The objective of my thesis is to identify the existence (and if so, measure the magnitude) of geographical pollution spillovers arising from the implementation in 2018 of the Madrid Central low emission zone (LEZ) through an empirical approach.
In this markdown file, I present the code used to reach the analyses and conclusions summarised in the dissertation “Beyond the Zone: Assessing Spillover Effects of Madrid Central on Air Pollution” (hereinafter “the main document”).
The report is structured as follows. Section 1 includes the pre-processing of the data (both pollution measures and information about pollution monitoring stations). Section 2 performs a descriptive analysis of pollution measurements by station, location and year, as well as of the location of stations with respect to the LEZ borders and the perimeter of the Community of Madrid. Section 3 presents visually the identification strategy. Section 4 introduces our model control variables, involving data wrangling, imputation of missing values and descriptive analysis by means of visualizations. Section 5 contains a series of Difference-in-Differences (DiD) models used to answer our research question, which follow some pre-modelling feature engineering and assumption-checking plots. Finally, Sections 6 and 7 encompass robustness checks and the appendix, respectively.
We begin by loading the necessary libraries for our analysis. Some of
them are used for data visualization and manipulation, such as
tidyverse, ggplot2, ggdag,
lubridate or forcats. Others are used for
geospatial analysis, mapping and visualization, namely
geosphere, sf, mapview,
ggmap and maptools. Additionally, we need
libraries for statistical analysis, in our case plm,
lmtest, estimatr, fixest and
mice. Lastly, we load libraries for formatting and output
like stargazer, cowplot,
corrplot, knitr and kableExtra.
We use vistime()and highcharter() to display
the policy timeline in the Appendix.
library(tidyverse)
library(ggplot2)
library(here)
library(zoo)
library(ggdag)
library(lubridate)
library(forcats)
library(geosphere)
library(climaemet)
library(sf)
library(mapview)
library(ggmap)
library(maptools)
library(plm)
library(lmtest)
library(estimatr)
library(fixest)
library(mice)
library(miceadds)
library(cowplot)
library(corrplot)
library(stargazer)
library(knitr)
library(kableExtra)
library(vistime)
library(highcharter)
Next, we run the following line to automatically set the working directory to the root directory of the project. This way, our code will work regardless of the current working directory or the location of this file.
# Set working directory
setwd(here::here())
We download hourly pollution data from the data portal of the City Council for 24 different pollution monitoring stations in the City of Madrid for all months between January 2015 and December 2022 (data is available since 2001).
We first create a loop to read pollution data for all years and clean the data to get to a dataframe called ‘all_city_data_clean’. Important steps in this process include renaming and selecting relevant columns, transforming the data from wide to long format, filtering the dataframe for NO2 (nitrogen dioxide, measured in micrograms per cubic metre or µg/m3) measures and adjusting variable types. We then aggregate the data to the monthly level (‘city_monthly’).
Also, we are interested in checking whether there are any stations with fewer than 18 observations per day and 270 observations per year (around 75% of total observations). We find that only the pollution monitoring station in Arturo Soria (station code 16) has fewer than 270 observations per year (264). However, we can see that this is only the case for 2022 and we will not use observations from this year for reasons introduced later. Hence, we do not remove the data yet to ensure comparability with the Community dataset (Section 1.2).
# `For` loop for hourly data for all months in all years
file_names <- c(paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo15.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo16.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo17.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo18.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo19.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo20.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo21.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
"_mo22.csv"))
# Create an empty dataframe to store the cleaned data
all_city_data_clean <- data.frame()
# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
# Read the data
city_data <- read_delim(file_name, delim=";")
# Apply transformations
city_data_clean <- city_data %>%
select(-c(PROVINCIA, PUNTO_MUESTREO)) %>%
rename(town_code = MUNICIPIO,
station_code = ESTACION,
magnitude = MAGNITUD,
year = ANO,
month = MES,
day = DIA) %>%
select(-matches("^V")) %>%
pivot_longer(cols = starts_with("H"), names_to = "hour", values_to = "pollution_level") %>%
# Keep only NO2 measures and remove 'magnitude' column
filter(magnitude == "8") %>%
select(-magnitude) %>%
# Adjust variable types
mutate(month = as.numeric(month),
day = as.numeric(day),
hour = as.numeric(str_replace(hour, "H", "")),
# Remove leading zeros from 'code' variables
town_code = as.factor(str_remove(town_code, "^0+")),
station_code = as.factor(str_remove(station_code, "^0+")),
# Replace comma with point for 'pollution_level'
pollution_level = as.numeric(gsub(",", ".", pollution_level)))
# Append the cleaned data to the 'all_city_data_clean' dataframe
all_city_data_clean <- bind_rows(all_city_data_clean, city_data_clean)
}
# Check for stations with less than 18 observations per day and 270 observations per year
stations_to_exclude <- all_city_data_clean %>%
group_by(town_code, station_code, year, month, day) %>%
summarise(n_obs = n()) %>%
group_by(town_code, station_code, year) %>%
summarise(n_obs_per_year = n(),
n_obs_per_day = mean(n_obs)) %>%
filter(n_obs_per_day < 18 | n_obs_per_year < 270)
print(stations_to_exclude)
## # A tibble: 1 × 5
## # Groups: town_code, station_code [1]
## town_code station_code year n_obs_per_year n_obs_per_day
## <fct> <fct> <dbl> <int> <dbl>
## 1 79 16 2022 264 24
# Aggregate the data to monthly level
city_monthly <- all_city_data_clean %>%
group_by(town_code, station_code, year, month) %>%
summarise(pollution_level = mean(pollution_level))
We employ a similar strategy involving loops to read pollution data from pollution monitoring stations outside the City of Madrid from 2015 to 2022. Hourly data for the 24 stations inside the Community of Madrid (excluding the Capital) since 2005 are available here. We apply virtually identical cleaning and aggregation steps to create our dataframe with monthly observations.
This time we find no stations with incomplete measurements when running the lines which create the object ‘stations_to_exclude’.
# Create file names vector
file_names <- c(paste0("com_", c(2015:2022), ".csv"))
# Create an empty dataframe to store the cleaned data
all_com_data_clean <- data.frame()
# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
# Read the data
com_data <- read_delim(file_name, delim=";")
# Apply transformations
com_data_clean <- com_data %>%
dplyr::select(-c(provincia, punto_muestreo)) %>%
rename(town_code = municipio,
station_code = estacion,
magnitude = magnitud,
year = ano,
month = mes,
day = dia) %>%
select(-matches("^V")) %>%
pivot_longer(cols = starts_with("h"), names_to = "hour", values_to = "pollution_level") %>%
# Keep only NO2 measures and remove 'magnitude' column
filter(magnitude == "8") %>%
select(-magnitude) %>%
# Adjust variable types, remove leading zeros and replace points with commas
mutate(month = as.numeric(month),
day = as.numeric(day),
hour = as.numeric(str_replace(hour, "h", "")),
town_code = as.factor(str_remove(town_code, "^0+")),
station_code = as.factor(str_remove(station_code, "^0+")),
pollution_level = as.numeric(gsub(",", ".", pollution_level)))
# Append the cleaned data to the 'all_com_data_clean' dataframe and drop NA
all_com_data_clean <- bind_rows(all_com_data_clean, com_data_clean) %>%
drop_na(pollution_level)
}
# Check for stations with less than 18 observations per day and 270 observations per year
stations_to_exclude <- all_com_data_clean %>%
group_by(town_code, station_code, year, month, day) %>%
summarise(n_obs = n()) %>%
group_by(station_code, year) %>%
summarise(n_obs_per_year = n(),
n_obs_per_day = mean(n_obs)) %>%
filter(n_obs_per_day < 18 | n_obs_per_year < 270) %>%
pull(station_code)
print(stations_to_exclude)
## factor()
## Levels: 1 14 2 3 4 5 7
# Aggregate the data to monthly level
com_monthly <- all_com_data_clean %>%
group_by(town_code, station_code, year, month) %>%
summarise(pollution_level = mean(pollution_level))
Complete dataset of monthly pollution observations for the air quality network of the Community of Madrid
We now combine the monthly pollution data from the 24 stations inside the City of Madrid and the 24 stations outside the City and create a dataframe called ‘full_data’ with the following columns: ‘town_code’ (see correspondence with municipality names in Sections 1.3 and 1.4), ‘station_code’, ‘year’, ‘month’ and ‘pollution_level’ (measured by NO2 concentration).
# Bind together observations of both datasets
full_data <- bind_rows(city_monthly, com_monthly)
We now download data pertaining to the pollution monitoring stations inside the City of Madrid. The network is made up of 24 automatic stations that collect basic information for atmospheric surveillance.
We apply data cleaning tools similar to the ones before. Note that we create a column for the town code since this is a variable in the dataset with stations outside the City of Madrid (Section 1.4) and we want both datasets to have the same columns to combine them later.
We then extract the longitude and latitude of all stations in the network. We generate a variable called ‘distance_km’ representing the distance of each station to the station in Plaza del Carmen (only one inside the LEZ and our policy centroid, see Section 2.3). We compute these distances using the Haversine formula. We then arrange all stations by increasing distance from the city centre (Plaza del Carmen). We will use this information to create our treatment rings later on.
# Open dataset with station features
city_stations <- read_delim("informacion_estaciones_red_calidad_aire.csv", delim=";")
# Rename and select variables
city_stations_clean <- city_stations %>%
# Create column for town_code and transform to factor
mutate(town_code = "079",
town_code = as.factor(str_remove(town_code, "^0+"))) %>%
select(
town_code,
station_code = CODIGO_CORTO,
station_name = ESTACION,
address = DIRECCION,
station_type = NOM_TIPO,
longitude = LONGITUD,
latitude = LATITUD) %>%
mutate(station_code = as.character(station_code))
# Replace "á" with "a"
city_stations_clean <- city_stations_clean %>%
mutate(station_type = gsub("á", "a", station_type),
#If 'station_type' is "Suburbana", replace with "Suburbana (ciudad)" for comparison with out-of-City stations
station_type = ifelse(station_type == "Suburbana", "Suburbana (ciudad)", station_type))
# Extract latitude and longitude of Plaza del Carmen
plaza_carmen_longitude <- city_stations_clean$longitude[city_stations_clean$station_name == "Plaza del Carmen"]
plaza_carmen_latitude <- city_stations_clean$latitude[city_stations_clean$station_name == "Plaza del Carmen"]
# Calculate the distance between each station and Plaza del Carmen
city_stations_clean <- city_stations_clean %>%
# Use `distHaversine()` function to calculate the distance between each station and Plaza del Carmen
mutate(distance_km = round(distHaversine(cbind(longitude, latitude),
c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>%
# Arrange the data in ascending order by 'distance_km'
arrange(distance_km)
# Display first few rows of the cleaned dataset
head(city_stations_clean)
## # A tibble: 6 × 8
## town_code station_code station_name address station_type longitude latitude
## <fct> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 79 35 Plaza del Carm… "Plaza… Urbana fondo -3.70 40.4
## 2 79 4 Plaza de España "Plaza… Urbana traf… -3.71 40.4
## 3 79 8 Escuelas Aguir… "Entre… Urbana traf… -3.68 40.4
## 4 79 49 Parque del Ret… "Paseo… Urbana fondo -3.68 40.4
## 5 79 48 Castellana "C/ Jo… Urbana traf… -3.69 40.4
## 6 79 47 Méndez Álvaro "C/ Ju… Urbana fondo -3.69 40.4
## # ℹ 1 more variable: distance_km <dbl>
For each station, we have information on their code (so that station and pollution data can be merged later on), address, location (i.e. longitude and latitude) and distance to the centre. We also see that pollution monitoring stations are of several types:
Urban background, which represent the exposure of the urban population to pollution in general;
Urban traffic, whose pollution measurements are mainly influenced by emissions from nearby streets or highways due to their strategic location;
Suburban, which generally have lower pollution levels (but higher ozone ones) because of their location on the outskirts of the City. We assign these stations the name “Suburbana (ciudad)” to create a distinction with suburban stations outside the City of Madrid.
Along the lines of the last section, we download and read data of pollution monitoring stations outside the Capital from the website of the Community of Madrid. We clean the data using regular expressions to fix the formatting of the longitude, latitude and address columns which had several issues (e.g. unnecessary quotation marks or commas). We also merge the columns containing information about area type and station type into a new column with the name of the latter to ensure comparability with data on stations inside the City of Madrid. We arrange the resulting dataframe so that stations closest to Plaza del Carmen are at the top.
# Open dataset with station features
com_stations <- read_delim("com_stations.csv", delim=";")
# Rename and select variables
com_stations_clean <- com_stations %>%
# Extract the town and station codes from the full station code and adjust their types
mutate(town_code = substr(estacion_codigo, 3, 5),
town_code = as.factor(str_remove(town_code, "^0+")),
station_code = substr(estacion_codigo, 6, 8),
station_code = as.factor(str_remove(station_code, "^0+"))) %>%
select(town_name = estacion_municipio,
town_code,
station_code,
station_type = estacion_tipo_estacion,
area_type = estacion_tipo_area,
rural_area_type = estacion_subarea_rural,
address = estacion_direccion_postal,
longitude = estacion_coord_longitud,
latitude = estacion_coord_latitud)
# Fix the latitude and longitude columns using regular expressions
com_stations_clean <- com_stations_clean %>%
# Remove quotes at the beginning and end of the longitude column
mutate(longitude = str_remove(longitude, "^\"|\",\\s*\"$")) %>%
# Replace commas with periods in the longitude column and convert to numeric
mutate(longitude = as.numeric(gsub(",", ".", longitude))) %>%
# Remove quotes and commas from the end of the latitude column
mutate(latitude = str_remove(latitude, "\"?,?$")) %>%
# Replace commas with periods in the latitude column and convert to numeric
mutate(latitude = as.numeric(gsub(",", ".", latitude)))
# Remove any sequence of two commas AND quotation marks from the address column
com_stations_clean <- com_stations_clean %>%
# Remove quotes from the address column
mutate(address = str_remove(address, '["\']')) %>%
# Remove any sequence of two commas from the address column
mutate(address = str_remove(address, ",{2}"))
# Create a new column that concatenates the 'area_type' and 'station_type' columns
com_stations_clean <- com_stations_clean %>%
# Combine the 'area_type' and 'station_type' columns with a space separator
mutate(station_type_new = paste(area_type, station_type, sep = " "),
# Use the `sapply()` function to convert the text in the 'station_type_new' column to title case
station_type_new = sapply(strsplit(as.character(station_type_new), " "),
function(x) paste0(x[1], " ", tolower(substring(x[2], 1, 1)),
substring(x[2], 2), collapse = " "))) %>%
# Remove the 'area_type' and 'station_type' columns and rename the new column as 'station_type'
select(-c(area_type, station_type)) %>%
rename(station_type = station_type_new)
# Calculate the distance between each station and Plaza del Carmen
com_stations_clean <- com_stations_clean %>%
# Use `distHaversine()` function to calculate the distance between each station and Plaza del Carmen
mutate(distance_km = round(distHaversine(cbind(longitude, latitude),
c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>%
# Arrange the data in ascending order by 'distance_km'
arrange(distance_km)
# Display first few rows of the cleaned dataset
head(com_stations_clean)
## # A tibble: 6 × 9
## town_name town_code station_code rural_area_type address longitude latitude
## <chr> <fct> <fct> <chr> <chr> <dbl> <dbl>
## 1 Leganes 74 7 No aplica C.E.P.A… -3.75 40.3
## 2 Getafe 65 14 No aplica C.E.I.P… -3.72 40.3
## 3 Coslada 49 3 No aplica Polidep… -3.54 40.4
## 4 Alcorcon 7 4 No aplica Colegio… -3.83 40.3
## 5 Alcobendas 6 4 No aplica Parque … -3.65 40.5
## 6 Majadahonda 80 3 No aplica Campo d… -3.87 40.4
## # ℹ 2 more variables: station_type <chr>, distance_km <dbl>
We also have 24 stations. After our transformation, we see that there are six types of stations according to both the type of station and area: urban background, urban traffic, suburban background, suburban traffic, urban industrial and rural background.
Complete dataset of data for all stations in the air quality network of the Community of Madrid
After selecting the relevant variables, we combine the data of all 48 stations in the Community of Madrid into a dataframe called ‘data_all_stations’.
# Select key variables for in-City stations
city_selection <- city_stations_clean %>%
select(town_code, station_code, station_type, address, longitude, latitude, distance_km)
# Select key variables for out-of-City stations
com_selection <- com_stations_clean %>%
select(town_code, station_code, station_type, address, longitude, latitude, distance_km)
# Combine both datasets and arrange in ascending order of distance from Plaza del Carmen
data_all_stations <- bind_rows(city_selection, com_selection) %>% arrange(distance_km)
After loading and formatting our pollution and pollution station data, we create a series of visualizations to understand it.
By way of introduction, we might be interested in how average pollution levels (NO2) compare across the different stations in the City of Madrid at a given point in time. We select year 2022 as it encompasses (so far) the most recent available data in our analysis. We assign labels with the names of the municipalities in Madrid to the different town codes. We then create our plot by grouping our data by town and obtaining the mean and standard error for the pollution level. We plot a point for each municipality with error bars indicating the standard error.
# Assign labels to levels of 'town_code'
my_levels <- c(5, 6, 7, 9, 13, 14, 16, 45, 47, 49, 58, 65, 67, 74, 79, 80, 92, 102, 120, 123, 133, 148, 161, 171, 180)
my_labels <- c("Alcala de Henares", "Alcobendas", "Alcorcon", "Algete", "Aranjuez", "Arganda del Rey",
"El Atazar", "Colmenar Viejo", "Collado Villalba", "Coslada", "Fuenlabrada", "Getafe",
"Guadalix de la Sierra", "Leganes", "Madrid", "Majadahonda", "Mostoles",
"Orusco de Tajuna", "Puerto de Cotos", "Rivas-Vaciamadrid", "San Martin de Valdeiglesias",
"Torrejon de Ardoz", "Valdemoro", "Villa del Prado", "Villarejo de Salvanes")
# Select necessary variables for the plot
full_data_plot <- full_data %>%
mutate(town_code_label = factor(town_code, levels = my_levels, labels = my_labels))
# Confidence interval plot for all stations in the Community of Madrid
full_data_plot %>%
filter(year == 2022) %>%
group_by(town_code_label) %>%
summarise(pollution_level_mean = mean(pollution_level),
pollution_level_se = sd(pollution_level)/sqrt(n())) %>%
ggplot(aes(reorder(town_code_label, pollution_level_mean), pollution_level_mean)) +
geom_point() +
geom_errorbar(aes(ymin = pollution_level_mean - 1.96 * pollution_level_se,
ymax = pollution_level_mean + 1.96 * pollution_level_se),
width = 0.4) +
labs(x = "Municipality Monitoring Station",
y = expression("Average Monthly NO"[2] ~ "(95% Confidence Interval)"),
title = "Air Quality by Municipality in the Community of Madrid (2022)") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +
ylim(0, 40) +
theme_light() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
axis.title.x = element_text(margin = margin(t = 10), size = 10),
axis.title.y = element_text(size = 10),
plot.margin = margin(b = 20))
There are significant differences between the pollution levels in large municipalities close to Madrid (like Leganés or Getafe) and those in smaller, rural areas (e.g. Puerto de Cotos). We will see later that this divergence of rural pollution stations makes them inadequate as control units in our analysis.
From the previous chart, we hypothesize that there may be an important difference between pollution levels in the City of Madrid (due to larger traffic flows) and those outside (despite the heterogeneity among the latter as seen in the previous plot). Hence, we will now compare both average pollution levels and the number of days with excessive air pollution levels by year and location.
To check how average pollution levels have evolved between 2015 and 2022 inside vis-à-vis outside the City of Madrid, we aggregate both sets of data (in-City vs. out-of-City) by year and compute the mean. We then combine the data and pivot it to wide format.
# Aggregate hourly data from inside the City of Madrid to yearly level and create 'Location' column
city_yearly <- all_city_data_clean %>%
group_by(year) %>%
summarise(avg_pollution_level = round(mean(pollution_level), 2),
# create 'Location' column to identify values for inside the City
Location = "Inside City of Madrid")
# Aggregate hourly data from outside the City of Madrid to yearly level and create 'Location' column
com_yearly <- all_com_data_clean %>%
group_by(year) %>%
summarise(avg_pollution_level = round(mean(pollution_level), 2),
# create 'Location' column to identify values for outside the City
Location = "Outside City of Madrid")
# Combine city and out-of-City data
summary_table_1 <- bind_rows(city_yearly, com_yearly)
# Convert the summary table to wide format (years as columns and areas as rows)
summary_table_wide_1 <- pivot_wider(summary_table_1, names_from = "year",
values_from = "avg_pollution_level",
values_fn = mean)
# Format and print the summary table
kable(summary_table_wide_1, format = "html", row.names = FALSE) %>%
kable_styling(c("striped", "bordered"), full_width = FALSE)
| Location | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 |
|---|---|---|---|---|---|---|---|---|
| Inside City of Madrid | 40.90 | 38.51 | 41.54 | 36.54 | 34.50 | 27.21 | 28.94 | 28.22 |
| Outside City of Madrid | 25.73 | 23.79 | 26.44 | 22.10 | 21.12 | 16.97 | 17.50 | 17.45 |
We see that year 2017 saw the highest average pollution levels both inside and outside the City of Madrid. In fact, in 2015 and 2017 the Capital was surpassing the limit of 40 µg/m3 for yearly average NO2 concentrations set by the Ambient Air Quality Directive. Still, throughout all years under consideration, NO2 levels in the air are considerably higher in the Capital than in the rest of the Community.
The limit of 40 µg/m3 is set on a yearly basis, which we just saw was violated for 2 years in the Capital. However, we might also want to dive a bit deeper and check whether the hourly limit of 200 µg/m3 (which shall not to be exceeded more than 18 times in a calendar year according to the Directive) was complied with.
We thus calculate the number of times (i.e. hours) the pollution level exceeded 200 each day for each location. We then combine the data from both datasets and create a summary table that shows the number of times the pollution level exceeded 200 µg/m3 for each year and location. That is, we take the sum of all days for which the pollution level exceeded 200 µg/m3 at least once for each group. We then pivot and display the data below.
# Create daily summaries of pollution level for stations both inside and outside the city
city_daily <- all_city_data_clean %>%
group_by(year, month, day) %>%
summarise(pollution_count = sum(pollution_level > 200))
com_daily <- all_com_data_clean %>%
group_by(year, month, day) %>%
summarise(pollution_count = sum(pollution_level > 200))
# Combine city and out-of-City data
all_daily <- bind_rows(
city = city_daily %>% mutate(Location = "Inside the City"),
com = com_daily %>% mutate(Location = "Outside the City"))
# Count number of times average hourly pollution levels exceeded 200 µg/m3 for each year and area
summary_table_2 <- all_daily %>%
group_by(Location, year) %>%
summarise(num_times = sum(pollution_count > 0))
# Convert the summary table to wide format
summary_table_wide_2 <- summary_table_2 %>%
pivot_wider(names_from = "year", values_from = "num_times")
# Format and print the summary table
kable(summary_table_wide_2, format = "html", row.names = FALSE) %>%
kable_styling(c("striped", "bordered"), full_width = FALSE)
| Location | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 |
|---|---|---|---|---|---|---|---|---|
| Inside the City | 61 | 47 | 63 | 38 | 43 | 13 | 15 | 8 |
| Outside the City | 28 | 5 | 12 | 0 | 1 | 0 | 0 | 0 |
Only from 2020 the number of times per year where hourly pollution levels exceeded 200 µg/m3 in Madrid was less than 18. Outside the city of Madrid only 2015 witnessed a violation of the rule for hourly NO2 concentrations as prescribed by the Directive.
Clearly, the LEZ in Madrid is meant to target the comparatively higher levels of pollution in the Capital. Hence, we would like to map the distribution of pollution stations in Madrid in relation to the LEZ perimeter (and the broader area of the City).
We convert our dataframe containing information on pollution stations to a spatial object using the longitude and latitude columns to define the coordinates and specifying the coordinate reference system (CRS) as EPSG:4326 (which represents WGS 84, a commonly used global CRS for geographic data).
We then load the LEZ and administrative boundary data we obtained from the geoportal of the Madrid City Council and transform the former to the same CRS as the station data. We do this to ensure both objects align for spatial analysis. We then spatially join the station and LEZ data, resulting in an object called ‘stations_in_LEZ’ that contains information about which stations are located within the restricted area. We determine whether a station is located inside or outside the Madrid Central (MC) area by its exact coordinates. Hence, we create a binary variable called ‘LEZ’ that is set to 1 for stations within the LEZ and 0 for stations outside.
First we display the resulting stations and the LEZ perimeter using
the mapview() function. We then use the same function to
create a second map that also incorporates the administrative boundaries
of the City of Madrid for the reader to compare the relative size of the
restricted area.
# Convert station data to spatial object
city_stations_coord.spdf <- st_as_sf(city_stations_clean,
coords = c("longitude", "latitude"),
crs = 4326)
# Load and transform LEZ data to match station coordinates (same CRS)
LEZ_limits <- st_read("ZBEDEP_Distrito_Centro_Limite.shp")
## Reading layer `ZBEDEP_Distrito_Centro_Limite' from data source
## `C:\Users\elena\OneDrive\Desktop\MUCSS\V. TFM\master-thesis-elena-yustres\ZBEDEP_Distrito_Centro_Limite.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 439078.8 ymin: 4472973 xmax: 441372.8 ymax: 4475758
## Projected CRS: ETRS89 / UTM zone 30N
LEZ_limits_t <- st_transform(LEZ_limits, st_crs(city_stations_coord.spdf))
# Spatially join stations and LEZ data
stations_in_LEZ <- st_join(city_stations_coord.spdf, LEZ_limits_t, join = st_within)
# Create binary variable to denote if station is within the LEZ
stations_in_LEZ$LEZ <-
ifelse(is.na(stations_in_LEZ$AMBITO_GEO), 0, 1)
# Filter for only station inside LEZ
stations_in_LEZ %>% filter(LEZ == 1)
## Simple feature collection with 1 feature and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -3.703166 ymin: 40.41921 xmax: -3.703166 ymax: 40.41921
## Geodetic CRS: WGS 84
## # A tibble: 1 × 9
## town_code station_code station_name address station_type distance_km
## * <fct> <chr> <chr> <chr> <chr> <dbl>
## 1 79 35 Plaza del Carmen Plaza del Ca… Urbana fondo 0
## # ℹ 3 more variables: geometry <POINT [°]>, AMBITO_GEO <chr>, LEZ <dbl>
# Display the stations on the map with the LEZ perimeter
mapview(stations_in_LEZ, label = "CENTRO", col.regions = "orange") + mapview(LEZ_limits_t, alpha.regions = 0.2)
# Load object with administrative boundaries of Madrid
madrid <- st_read("madrid_.geojson")
## Reading layer `madrid_' from data source
## `C:\Users\elena\OneDrive\Desktop\MUCSS\V. TFM\master-thesis-elena-yustres\madrid_.geojson'
## using driver `GeoJSON'
## Simple feature collection with 128 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.887647 ymin: 40.31325 xmax: -3.516929 ymax: 40.64444
## Geodetic CRS: WGS 84
# Display the stations on the map with LEZ perimeter and Madrid administrative boundaries
mapview(stations_in_LEZ, label = "CENTRO", col.regions = "orange") + mapview(madrid, alpha.regions = 0.1) + mapview(LEZ_limits_t, alpha.regions = 0.5)
The reader is encouraged to hover over both maps and chick on any given station to find information about its name, type and address, among others.
From the dataframe ‘stations_in_LEZ’ and both plots we can gather that only the station in Plaza del Carmen is inside the LEZ. This implies we can treat it as our “policy centroid”. Also, the LEZ is relatively small compared to the area of the entire municipality.
With the goal of quantifying the spillover effects of the policy beyond the borders of the LEZ, we define a set of rings which segment all 48 stations in the Community of Madrid network according to their distance from the Madrid city centre.
We define arbitrary cutoff points at 2.5, 8, 15, 20 and 40 kilometres from Plaza del Carmen, thereby yielding six different rings or categories of stations. The six rings have 4, 12, 14, 4, 6 and 8 stations, respectively.
# Specify cutoffs for 'distance_km' and assign ring labels
data_all_stations$ring <- cut(data_all_stations$distance_km, breaks = c(-Inf, 2.5, 8, 15, 20, 40, Inf),
labels = c("Ring 1: Treated",
"Ring 2: Treated",
"Ring 3: Treated",
"Ring 4: Treated",
"Ring 5: Control",
"Ring 6: Excluded"))
# Display number of stations in each ring
data_all_stations %>% group_by(ring) %>% count()
## # A tibble: 6 × 2
## # Groups: ring [6]
## ring n
## <fct> <int>
## 1 Ring 1: Treated 4
## 2 Ring 2: Treated 12
## 3 Ring 3: Treated 14
## 4 Ring 4: Treated 4
## 5 Ring 5: Control 6
## 6 Ring 6: Excluded 8
Next, we set a bounding box that encompasses the Community of Madrid
by defining its minimum and maximum longitude and latitude coordinates.
We then retrieve the map from the Stamen
Maps API using get_stamenmap(). The map is plotted
using ggmap(), which allows us to add the map tiles to a
‘ggplot2’ object. The ‘extent’ argument is set to “normal” to ensure
that the aspect ratio of the map is preserved and ‘maprange’ is set to
“FALSE” to prevent ggmap() from adjusting the axes of the
plot. Each station is a point on the map and its corresponding ring is
coded through the color aesthetic.
# Set bounding box coordinates
bbox_madrid <- c(-4.647491, 39.804255 , -2.942500, 41.221046)
# Retrieve map using `get_stamenmap()`
madrid_map <- get_stamenmap(bbox = bbox_madrid, zoom = 10, maptype = "toner-lite")
# Define colors for each ring
ring_colors <- c("#23541e", "#398f33", "#70cc3e", "#a9ff4d", "#1AA7EC", "#b30000")
# Create plot
ggmap(madrid_map, extent = "normal", maprange = FALSE, alpha = 0.3) +
geom_point(data = data_all_stations,
aes(x = longitude, y = latitude, color = ring),
size = 1.5) +
scale_color_manual(values = setNames(ring_colors, levels(as.factor(data_all_stations$ring)))) +
labs(title = "Location of Pollution Monitoring Stations in the Community of Madrid: Specification by Rings",
color = "Ring") +
theme_void() +
theme(legend.position = "bottom",
legend.background = element_rect(colour = "black", size = 0.3),
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 7),
legend.title = element_text(size = 8),
legend.box.spacing = unit(0.1, "cm"),
legend.box.margin = margin(0, 0, 0, 0),
legend.box.background = element_rect(color = "black", linewidth = 0.2),
plot.title = element_text(size = 12, hjust = 0.5))
To close off the descriptive analysis on pollution-related data, we want to know if there are any differences in the evolution of NO2 pollution levels (2015 - 2022) between urban and suburban stations (see description in Section 1.3).
We start by merging monthly pollution data for the city of Madrid with information about those same stations on the one hand, and pollution data for stations outside city of Madrid with information about those other stations on the other. We add a variable called ‘type’ to both datasets to indicate if a given station is urban or suburban.
Importantly, we remove rural (background) stations from the dataset with out-of-Capital stations because, as we saw before, these stations are intrinsically different from those in urban settings even outside the city. They are located in places with very little traffic and thus, when put together with other stations outside the City of Madrid, differences in pollution levels with City stations are falsely amplified.
We combine our dataframes and replace NAs in the ‘town_name’ column by ‘Madrid’ since the ‘city_stations_clean’ dataframe lacks this column in contrast with ‘com_stations_clean’. We only want observations for January and July for each year since we are also interested in analyzing whether there are significant differences between different months in the year.
Finally, we create the column ‘type_color’ to categorize our observations into four different labels based on the station type and location (in- vs. out-of-City). We calculate monthly averages for each category by grouping the data by ‘year’, ‘month’, ‘type’ and ‘type_color’ and plot them using the color and line type aesthetics.
# Join 'city_monthly' with 'city_stations' by 'town_code' and 'station_code'
city_data <- city_monthly %>%
left_join(city_stations_clean, by = c("town_code", "station_code"))
# Create a new column ('type') to indicate if the station is urban or suburban (in-City data)
city_data$type <- if_else(city_data$station_type == "Urbana trafico" | city_data$station_type == "Urbana fondo",
"Urban", "Suburban")
# Join 'com_monthly' with 'com_stations' by 'town_code' and 'station_code'
com_data <- com_monthly %>%
left_join(com_stations_clean, by = c("town_code", "station_code")) %>%
# Remove 'rural stations'
filter(station_type != "Rural fondo")
# Create a new column ('type') to indicate if the station is urban or suburban (out-of-City data)
com_data$type <- ifelse(grepl("^Urbana", com_data$station_type), "Urban", "Suburban")
# Combine both dataframes into a single one for the whole Community
full_data_complete <- bind_rows(city_data, com_data) %>%
select(-c(station_name, rural_area_type))
# Replace the missing values in 'town_name' by Madrid
full_data_complete$town_name <- ifelse(is.na(full_data_complete$town_name), "Madrid", full_data_complete$town_name)
# Filter to include only January and July of each year
full_data_plot <- full_data_complete %>%
filter(month %in% c(1, 7)) %>%
# Create column with 4 different labels with combinations of type and location
mutate(type_color = if_else(town_code == 79 & type == "Urban", "City_Urban",
if_else(town_code == 79 & type == "Suburban", "City_Suburban",
if_else(town_code != 79 & type == "Urban", "NonCity_Urban", "NonCity_Suburban")))) %>%
# Obtain monthly averages for each category
group_by(year, month, type, type_color) %>%
summarise(pollution_level = mean(pollution_level))
# Plot the data
full_data_plot %>%
ggplot(aes(x = as.Date(paste(year, month, "01", sep = "-")),
y = pollution_level,
color = ifelse(grepl("_Urban", type_color), "Urban Station", "Suburban Station"),
linetype = ifelse(grepl("^City", type_color), "Inside City", "Outside City"))) +
geom_line(linewidth = 0.4) +
labs(x = "Year",
y = expression("Monthly NO"[2] ~ "Pollution Averages"),
linetype = "Station Location",
color = "Station Type",
title = expression("Evolution of NO"[2] ~ "Levels in the Community of Madrid by Location and Station Type (2015 - 2022)")) +
scale_linetype_manual(values = c("solid", "dashed")) +
scale_color_manual(values = c("darkred", "darkblue")) +
theme_light() +
theme(plot.title = element_text(size = 11))
First, there does seem to be a clear difference between urban and suburban stations, both when considering stations inside and outside the City of Madrid. Urban stations (blue) appear to always have higher average pollution levels than suburban ones (red), regardless of whether the station is inside or outside the Capital.
Second, we see that average pollution is higher in January than in July for every year, type of station and location. This could be due to increased traffic during holiday season compared to the summer months, the increased use of heating due to lower temperatures or meteorological conditions themselves (such as low wind speeds). These differences highlight the need for incorporating weather-related controls (see Section 4 below) and month fixed effects (see Section 5) into our model.
Let’s take a step back. We are trying to assess how the Madrid LEZ affects the pollution levels outside the perimeter of the area under restrictions. We make use of the following Directed Acyclic Graph (DAG) to visually display our identification strategy aimed at isolating and quantifying this effect (Pearl 2009).
We use the dagify() function from the ggdag
package. Each node represents a variable in the causal model and the
edges represent the causal relationships between them. We specify the
treatment as the Madrid LEZ and the outcome
variable as pollution levels outside the LEZ. We then define
three possible mechanisms or ways by which our treatment may affect our
outcome, giving rise to the following mediators:
Redirection of traffic from areas subject to restrictions to areas without restrictions, thereby increasing pollution outside the LEZ
Increased use of public transport to avoid restrictions on private vehicles, thereby decreasing pollution outside the LEZ (as well as inside)
Replacement of highly polluting vehicles with “greener” ones by drivers to avoid fines, thereby decreasing pollution outside the LEZ (as well as inside)
Finally, we theorize that there may be some potential confounders which could bias our estimate by causing a change in pollution levels not attributable to the LEZ. First, macroeconomic conditions may affect the use of private vehicles (and hence of public transportation by the substitution effect) and the willingness or ability to purchase new ones, thus affecting mediators (2) and (3). Second, large public works in the city centre could have an important effect on channels (1) and (2) by “forcing” people to take alternative routes to avoid traffic jams or by encouraging them to use the subway, respectively (see Section 6.1.F). Third and last, meteorological conditions may affect NO2 concentrations independently from the existence of restrictions, thus posing a threat to identification.
We will see that the impact of both public works and economic indicators is accounted for by time and unit fixed effects in our model. In contrast, weather effects may escape the control of the intersection of these two kinds of fixed effects and we thus need to incorporate them explicitly into our model.
We manually set the color of each node according to its type
(outcome, treatment, mediator or confounder) using the
scale_color_manual() function.
# Set the theme of the plot to a DAG theme
theme_set(theme_dag())
# Define the coordinates of the nodes in the DAG
coord_dag <- list(
x = c("fleet_recomposition" = 1.5, "public_transport" = 1.5, "redirected_traffic" = 1.5,
"MC" = 0, "pollution_outside" = 3,
"weather" = 3.5, "economic" = 2, "public_works" = 2),
y = c("fleet_recomposition" = 1, "public_transport" = 2, "redirected_traffic" = 3,
"MC" = 2, "pollution_outside" = 2,
"weather" = 0.5, "economic" = 0.5, "public_works" = 4))
# Define the DAG structure and label the nodes
pollution_dag <- dagify(pollution_outside ~ fleet_recomposition + public_transport + redirected_traffic,
pollution_outside ~ weather,
redirected_traffic ~ MC,
redirected_traffic ~ public_works,
fleet_recomposition ~ MC,
public_transport ~ public_works,
fleet_recomposition ~ economic,
public_transport ~ MC,
public_transport ~ economic,
labels = c(
"pollution_outside" = "POLLUTION\n OUTSIDE LEZ",
"redirected_traffic" = "(1) Redirection\n of traffic",
"public_transport" = "(2) Higher relative use\n of public transport",
"fleet_recomposition" = "(3) Fleet\n recomposition",
"MC" = "LEZ",
"weather" = "Weather",
"public_works" = "Public works",
"economic" = "Economic\n conditions"
),
coords = coord_dag,
outcome = "pollution_outside")
# Plot the DAG without text labels and with color-coding based on node type
p <- ggdag(pollution_dag, text = FALSE, use_labels = "label")
p$layers[[3]]$mapping <- aes(colour = ifelse(
name == "pollution_outside", "Outcome",
ifelse(
name == "MC", "Treatment",
ifelse(
name %in% c("fleet_recomposition", "public_transport", "redirected_traffic"),
"Mediators",
"Confounders"))))
# Add color legend
p + scale_color_manual(values = c("gray", "skyblue", "maroon", "darkgreen")) +
# Remove legend and adjust plot margins
theme(legend.position = "none") +
# Add plot title
labs(title = "DAG: Causal Effect of Madrid LEZ on Neighbouring Areas")
Taking up the last point in the previous section, we want to include weather data to control for time-variant station heterogeneity (more on this in the main document).
We access weather data provided by the Spanish Meteorological Agency (Agencia Estatal de Meteorología, AEMET) through the AEMETOpenData REST API. Note that, in general, REST APIs use HTTP requests to communicate data and functionality between client and server applications. Our API specifically grants us access to data on ozone contents, satellite information, climate predictions or daily as well as monthly observations for a series of meteorological variables.
We use the function aemet_api_key() to set the key for
the AEMET API, which was obtained per request and generated from the
user’s email address.
# Access AEMET API
aemet_api_key("eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJlbGVuYS55dXN0cmVzQGdtYWlsLmNvbSIsImp0aSI6Ijg5MDM3MDE1LTEzMzMtNDkwZS1iZWRjLTM3NTk0ZmVkYTQ0YiIsImlzcyI6IkFFTUVUIiwiaWF0IjoxNjc3NDQ2ODA2LCJ1c2VySWQiOiI4OTAzNzAxNS0xMzMzLTQ5MGUtYmVkYy0zNzU5NGZlZGE0NGIiLCJyb2xlIjoiIn0.9N_v-cQ513FFOlcIwwXIVxXoQSYmQXCeB1yVCfJchXs", overwrite = TRUE)
Before working with data on the meteorological variables themselves, we access data on the 13 weather stations scattered around the Community of Madrid.
After cleaning our data, we want to match each pollution monitoring station with its closest weather station. To that end, we extract a matrix of coordinates for all air pollution monitoring stations and weather stations in the Community from the ‘data_all_stations’ and the newly created ‘weather_stations’ dataset, respectively. We then calculate the distances between pollution and weather stations using (again) the Haversine formula and match each air pollution station to the closest weather station. We add this information as a new column to the ‘data_all_stations’ dataframe we created earlier.
# Access data on weather stations and make transformations
weather_stations <- aemet_stations(verbose = FALSE, return_sf = FALSE) %>%
filter(provincia == "MADRID") %>%
select(-c(indicativo, indsinop, provincia)) %>%
rename(altitude = altitud,
longitude = longitud,
latitude = latitud,
weather_station_name = nombre)
# Create a matrix of coordinates for air pollution stations
coords_pollution <- data_all_stations[, c("longitude", "latitude")]
# Create a matrix of coordinates for weather stations
coords_weather <- weather_stations[, c("longitude", "latitude")]
# Calculate distances between each air pollution station and each weather station
distances <- distm(coords_pollution, coords_weather, fun = distHaversine)
# Find the index of the closest weather station for each air pollution station
closest_weather_index <- apply(distances, 1, which.min)
# Match each air pollution station to the closest weather station
data_all_stations$closest_weather_station <- weather_stations$weather_station_name[closest_weather_index]
While we obtained data for 2022 to show the evolution of pollution levels using the most recent observations, pollution data after June 2021 will not be used in our model for reasons outlined later and in the main document. Hence, we will only access weather data from 2015 until the end of 2021 to keep all observations for all seven years for visualization purposes.
We now use aemet_daily_clim() to obtain daily data for
all stations in Madrid. We clean the data by selecting the relevant
meteorological variables as found in the literature (see Section 4 in
the main document): average, minimum and maximum temperature (we will
later select the most relevant out of these three), total rain,
direction of maximum wind speed, average wind speed and minimum and
maximum pressure (we will later compute the average to avoid
collinearity). We then adjust variable types and acknowledge as missing
values all observations containing the value ‘88’ for the wind direction
variable following the information available in the metadata.
# Obtain daily weather values for all stations in Madrid
weather_full <- aemet_daily_clim(station = "all", start = "2015-01-01", end = "2021-12-31",
verbose = FALSE, return_sf = FALSE) %>%
filter(provincia == "MADRID")
# Make transformations: select, adjust variable types and create new columns for the date
weather_full_final <- weather_full %>%
select(date = fecha,
weather_station_name = nombre,
avg_temperature = tmed,
min_temperature = tmin,
max_temperature = tmax,
total_rain = prec,
direction_max_wind_speed = dir,
avg_wind_speed = velmedia,
min_pressure = presMin,
max_pressure = presMax) %>%
# Replace comma with point
mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
direction_max_wind_speed = as.numeric(direction_max_wind_speed),
# Obtain 'year' and 'month' columns from 'date' column
date = ymd(date),
year = year(date),
month = month(date),
day = day(date)) %>%
select(-date)
# Replace all values of "88" in 'wind direction' with NAs
weather_full_final$direction_max_wind_speed <- ifelse(weather_full_final$direction_max_wind_speed == "88", NA,
weather_full_final$direction_max_wind_speed)
At first glance, we notice that we face the problem of missing values in our weather variables. We will use multiple imputation to tackle this issue. This method uses the distribution of the observed data to estimate multiple plausible values that reflect the uncertainty about the missing observations (Rubin 1976). While it generally performs better than zero- or mean-/median- imputation and is most likely superior to dropping observations altogether, multiple imputation relies on a few assumptions we must check first.
Let’s start by checking the percentage of missing values in each of our potential controls:
# Create a vector of column names to exclude
exclude_cols <- c("year", "month", "day", "weather_station_name")
# Calculate the percentage of missing values for each numeric variable in the dataframe
NA_percentage <- sapply(weather_full_final[ , !(colnames(weather_full_final) %in% exclude_cols)],
function(x) mean(is.na(x))) * 100
# Display the result
print(NA_percentage)
## avg_temperature min_temperature max_temperature
## 4.382683 4.382683 4.376395
## total_rain direction_max_wind_speed avg_wind_speed
## 5.854057 12.453234 10.491401
## min_pressure max_pressure
## 20.800453 20.803597
It seems that the temperature and rain variables do not pose a big problem since they only have around 5% of missing values. Wind-related variables have around 10% missing values, which is worse but still within a “normal range.” The issue is not as clear with the pressure-related variables, for which around a fifth of observations are missing.
Hence, we do some checks taking minimum pressure as our example precisely because of its relatively higher number of missing values (together with maximum pressure). Before that, however, we print its five-number summary in order to later compare these values in the variable before and after imputation.
# Print 5-number summary for 'min_pressure'
summary(weather_full_final$min_pressure)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 780.4 890.0 931.0 908.1 942.9 971.9 6616
From a theoretical perspective, multiple imputation relies on the assumption of Missing at Random (MAR). This states that the probability of a data point being missing depends only on the observed data and not on the missing values themselves. While this assumption cannot be tested statistically, we can check whether missingness is associated with the station by plotting the frequency of NAs in minimum pressure by the weather station:
# Compute number of NAs for 'min_pressure' in each station
na_counts <- weather_full_final %>%
group_by(weather_station_name) %>%
summarise(na_count = sum(is.na(min_pressure)))
# Plot histogram of missing values per station
ggplot(na_counts, aes(x = reorder(weather_station_name, -na_count), y = na_count)) +
geom_bar(stat = "identity", fill = "lightyellow", color = "black") +
labs(x = "Weather Station",
y = "Number of Missing Values",
title = "Missing Values in Minimum Pressure by Station") +
theme_light() +
theme(axis.text.x = element_text(angle = 70, hjust=1, size = 7))
We see that the stations in Aranjuez, Ciudad Universitaria or Somosierra have a considerably high number of missing values compared to other stations. This suggests that the missingness of ‘min_pressure’ (and for the other variables for which we repeated this process) might not be missing at random.
Nevertheless, there are two main reasons why we will proceed with the multiple imputation of our weather covariates.
First, since we include station fixed effects in our models (Section 5), we can think of missingness as an additional attribute that varies across stations besides, for instance, the amount of traffic. Hence, assuming missingness is captured by these unit fixed effects, we want to ensure that we do not observe more missing values at specific dates (which would be problematic). To check this, we create a similar plot but this time we group by date rather than station:
# Compute number of NAs for 'min_pressure' for each date
na_counts <- weather_full_final %>%
group_by(year, month) %>%
summarise(na_count = sum(is.na(min_pressure))) %>%
mutate(date = as.yearmon(paste(year, month, sep = "-"), format = "%Y-%m"))
# Define the levels of the date column in chronological order
na_counts$date <- factor(na_counts$date, levels = unique(na_counts$date))
# Filter the data to select a subset of labels
selected_labels <- na_counts$date[seq(1, nrow(na_counts), length.out = 20)]
# Plot histogram of missing values per date
ggplot(na_counts, aes(x = date, y = na_count)) +
geom_bar(stat = "identity", fill = "lightyellow", color = "black") +
labs(x = "Date",
y = "Number of Missing Values",
title = "Missing Values in Minimum Pressure by Date") +
theme_light() +
theme(axis.text.x = element_text(angle = 70, hjust = 1, size = 7)) +
scale_x_discrete(breaks = selected_labels)
While March of 2018 seems to have an unusually high number of missing values, there does not seem to be any pattern indicating missingness is strongly correlated with time. We hence have a more or less balanced panel.
Second, we see that the imputed distributions of ‘min_pressure’ are
rather similar to the original one in the plots below. In order to
visualize how the imputed values are distributed for this specific
variable, we use the package mice()
(standing for ‘Multivariate Imputation by Chained Equations’) with three
different algorithms: Predictive Mean Matching (“pmm”), Lasso Regression
(“lasso.norm”) and Classification and Regression Trees (“cart”).
We use ten imputations for each missing value (m=10) and set a random seed to ensure the replicability of our imputation results.
set.seed(1234)
# Create dataframe with original and imputed values using different algorithms
mice_imputed <- data.frame(
original = weather_full_final$min_pressure,
imputed_pmm = complete(mice(weather_full_final, m=10, method = "pmm"))$min_pressure,
imputed_cart = complete(mice(weather_full_final, m=10, method = "cart"))$min_pressure,
imputed_lasso = complete(mice(weather_full_final, m=10, method = "lasso.norm"))$min_pressure)
We plot the resulting distributions below:
# Create histograms for the distribution of each algorithm and compare with original one
MICE_plot1 <- ggplot(mice_imputed, aes(x = original)) +
geom_histogram(binwidth = 4, color = "black", fill = "lightblue") +
labs(x= "Values",
y = "Frequency",
title = "Original Distribution of Minimum Pressure") +
theme_light() +
theme(plot.title = element_text(size = 9))
MICE_plot2 <- ggplot(mice_imputed, aes(x = imputed_pmm)) +
geom_histogram(binwidth = 4, color = "black", fill = "lightpink", position = "identity") +
labs(x= "Values",
y = "Frequency",
title = "PMM-Imputed Distribution of Minimum Pressure") +
theme_light() +
theme(plot.title = element_text(size = 9))
MICE_plot3 <- ggplot(mice_imputed, aes(x = imputed_cart)) +
geom_histogram(binwidth = 4, color = "black", fill = "lightgreen", position = "identity") +
labs(x= "Values",
y = "Frequency",
title = "Cart-Imputed Distribution of Minimum Pressure") +
theme_light() +
theme(plot.title = element_text(size = 9))
MICE_plot4 <- ggplot(mice_imputed, aes(x = imputed_lasso)) +
geom_histogram(binwidth = 4, color = "black", fill = "lightsalmon", position = "identity") +
labs(x= "Values",
y = "Frequency",
title = "Lasso-Imputed Distribution of Minimum Pressure") +
theme_light() +
theme(plot.title = element_text(size = 9))
plot_grid(MICE_plot1, MICE_plot2, MICE_plot3, MICE_plot4, nrow = 2, ncol = 2)
Although all algorithms yield approximately equivalent distributions, PMM seems to result in the closest to the original distribution. Due to this and the relatively lower required computational time, we choose PMM to impute all our weather variables.
Note that PMM is used as part of a chained equation approach, whereby in each iteration each variable with missing values is imputed by finding the nearest observed value to the missing value. It relies on the use of a regression model that includes all other variables in the dataset (both observed and imputed).
set.seed(1234)
# Specify columns to impute
cols_to_impute <- c("avg_temperature", "max_temperature", "min_temperature",
"total_rain", "direction_max_wind_speed", "avg_wind_speed",
"min_pressure", "max_pressure")
# Impute missing values in the columns using PMM
imputed_data <- mice(weather_full_final[cols_to_impute], m = 10, method = "pmm")
# Update original dataframe with imputed values
weather_full_final[cols_to_impute] <- complete(imputed_data)[, cols_to_impute]
To account for information beyond the visual inspection of the imputed distributions above, we see that the 5-number summary of minimum pressure is very similar to the one of the original variable.
# Print 5-number summary for imputed variable
summary(weather_full_final$min_pressure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 780.4 893.0 932.0 909.3 943.1 971.9
Hence, imputation seems to work well with this and the rest of our eight weather variables. After imputing them, we aggregate observations by month. We extract the mean for the temperature- and wind-related variables, while we compute the sum for the rain covariate following common practice. Finally, we combine the information of the ‘min_pressure’ and ‘max_pressure’ variables into a single one (‘avg_pressure’) by averaging them out.
# Group imputed weather variables and compute monthly averages
weather_full_final_a <- weather_full_final %>%
group_by(weather_station_name, year, month) %>%
summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
min_temperature_monthly = round(min(min_temperature), 2),
max_temperature_monthly = round(max(min_temperature), 2),
total_rain_monthly = round(sum(total_rain), 2),
avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))
In the event that there are still doubts regarding the assumptions of the imputation procedure carried out in this section, Section 7.1. goes throughout an alternative strategy involving weights based on probabilities of missingness.
Now that we our variables are free of missing values, we will create a series of visualizations to describe how they vary across station rings as well as through the time window under study.
We will work with the following variables (at the monthly level):
(Average) temperature (ignore maximum and minimum until the next section)
(Total) rain
(Average) wind speed
(Average) direction of maximum wind speed
(Average) pressure
We start by exploring the relationship between our five weather variables and station rings. Recall that Ring 1 is the closest to, and Ring 6 is the farthest away from, Plaza del Carmen (city centre). We merge weather data with our dataset containing information about air quality monitoring stations and the corresponding closest weather station for each by the latter variable. We do this because we are interested in visualizing weather variables as a function of the ring of pollution stations each corresponding weather station is closest to.
Importantly, we replace missing values for the ‘ring’ column in the newly created dataframe by “Ring 6: Excluded”. The reason is that all missing values corresponded to the weather station in Somosierra since it was not selected as the closest weather station for any of our pollution stations in the computation performed earlier. As such, weather observations taken in Somosierra had no information about the ring and we thus had to assign the last ring to them manually.
We then use boxplots to compare the distributions of our variables in a given year (2019 for instance, as it is the first year of full implementation of Madrid Central). We select only average temperature to avoid redundancies, pivot our data, recode the newly created weather variable to select its levels manually and display the plot.
# Merge weather data with pollution station data by weather pollution station
weather_all <- weather_full_final_a %>% left_join(data_all_stations,
by = c("weather_station_name" = "closest_weather_station")) %>%
select(-c(town_code, station_code, station_type, address, longitude, latitude, distance_km))
# Replace all NA values in 'ring' with the level "Ring 6: Excluded"
weather_all$ring <- fct_explicit_na(weather_all$ring, na_level = "Ring 6: Excluded")
# Create the plot
weather_all %>%
# Remove minimum and maximum temperature variables for simplicity
select(-c(min_temperature_monthly, max_temperature_monthly)) %>%
# Pivot weather data to long format
pivot_longer(cols = c("avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly",
"avg_wind_speed_monthly", "avg_pressure_monthly"),
names_to = "weather_variable", values_to = "values") %>%
# Transform variable with weather variables as categories into factor and recode
mutate(weather_variable = factor(weather_variable),
weather_variable = case_when(
weather_variable == "avg_temperature_monthly" ~ "Average Temperature",
weather_variable == "total_rain_monthly" ~ "Total Rain",
weather_variable == "wind_direction_monthly" ~ "Average Wind Direction",
weather_variable == "avg_wind_speed_monthly" ~ "Average Wind Speed",
weather_variable == "avg_pressure_monthly" ~ "Average Pressure")) %>%
filter(year == 2019) %>%
# Plot the data with boxplots grouped by ring
ggplot(aes(x = ring, y = values, fill = ring)) +
geom_boxplot() +
facet_wrap(weather_variable ~ ., scales = "free_y") +
labs(x = "Ring",
y = "Measurement",
title= "Distribution of Weather Variables by Station Ring (2019)",
fill = "Station Ring") +
scale_fill_manual(values = ring_colors) +
theme_light() +
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 8),
plot.title = element_text(size = 12),
legend.position = "none")
We see that for total rain and average temperature there are no significant differences across the different stations grouped into rings as the medians are quite similar. In contrast, wind-related variables and average pressure seem to be subject to larger differences. Stations in Ring 3 appear to have higher average wind direction (and wind speed but with smaller differences compared to Rings 4 and 5). For pressure, the medians are not that different but the distributions (especially the minimum values) appear to be.
We saw some differences with data aggregated at the yearly level. However, we want to assess whether there are any interesting patterns in the evolution of our weather variables for a given year - again, 2019 - across station rings.
We perform similar data transformation operations as before, with the difference that now we are actually plotting the mean value of each weather variable for each station ring in each month of 2019. We use lines color-coded by station ring, with each weather variable plotted separately in its own facet.
# Create the plot
weather_all %>%
# Remove minimum and maximum temperature variables for simplicity
select(-c(min_temperature_monthly, max_temperature_monthly)) %>%
# Pivot the data to long format
pivot_longer(cols = c("avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly",
"avg_wind_speed_monthly", "avg_pressure_monthly"),
names_to = "weather_variable", values_to = "values") %>%
# Transform and recode the weather variable
mutate(weather_variable = as.factor(weather_variable),
weather_variable = case_when(
weather_variable == "avg_temperature_monthly" ~ "Average Temperature",
weather_variable == "total_rain_monthly" ~ "Total Rain",
weather_variable == "wind_direction_monthly" ~ "Average Wind Direction",
weather_variable == "avg_wind_speed_monthly" ~ "Average Wind Speed",
weather_variable == "avg_pressure_monthly" ~ "Average Pressure")) %>%
# Filter by year 2019
filter(year == 2019) %>%
# Group by ring and month and calculate the mean value of each weather variable
group_by(ring, month, weather_variable) %>%
summarise(mean_value = mean(values)) %>%
ungroup() %>%
# Plot the data with lines color-coded by ring
ggplot(aes(x = month, y = mean_value, color = ring)) +
geom_line() +
facet_wrap(weather_variable ~ ., scales = "free_y") +
labs(x = "Month",
y = "Measurement",
title= "Evolution of Monthly Averages of Weather Variables by Station Ring (2019)",
color = "Ring") +
scale_color_manual(values = ring_colors) +
scale_x_continuous(breaks = c(2, 5, 8, 11),
labels = c("Feb", "May", "Aug", "Nov")) +
theme_light() +
theme(axis.text.x = element_text(hjust = 1, size = 8),
plot.title = element_text(size = 12))
For some variables (e.g. temperature and, to a lesser extent, pressure and wind speed), the trends seem to be more or less parallel across station rings. In these cases, adding month and station fixed effects (see Section 5) could potentially be enough since temperature and pressure seem to be varying consistently through time across rings. However, meteorological conditions still change over time for all rings and the other variables appear to be subject to even stronger time-varying changes across rings. Thus, the inclusion of these variables into our models is essential to control for heterogeneity which could hamper our estimates.
With all the necessary data loaded and (almost) properly transformed, we are ready to kick off the modelling stage. We will employ a Difference-in-Differences (DiD) design (Snow 1885) to uncover whether there were any spatial pollution spillovers as a result of the introduction of restrictions in Madrid Central, and if so, how “big” they are and how far out they are significant.
We start by checking the main assumption behind DiD: Parallel Trends. We plot the evolution of yearly averages (for simplicity purposes) for each of our six rings in order to test whether they were following similar (i.e. parallel) trends prior to the policy implementation.
In order to create that plot we must first carry our some data manipulation. First we merge the dataset containing pollution data for stations in the Community of Madrid (‘full_data_complete’) and the one with data on pollution stations and the matched weather stations (‘data_all_stations’). While these two dataframes have many variables in common, the former contains pollution data and the latter contains information about rings and thus the need to merge them.
We then filter out data from July 2021 onwards because of the fundamental changes to the policy taking place at that time (see main document). We select the necessary variables for plotting, after which we calculate the average yearly pollution level for each year and ring.
# Merge data with monthly pollution averages and station data
model_pollution_df <- full_data_complete %>%
left_join(data_all_stations, by = c("town_code", "station_code", "address", "station_type",
"longitude", "latitude", "distance_km")) %>%
ungroup()
# Remove data from July 2021 onwards
model_pollution_df <- model_pollution_df %>%
filter(year < 2021 | (year == 2021 & month < 7))
# Create a subset of the data with the necessary columns for plotting
model_pollution_subset <- model_pollution_df %>%
select(year, month, pollution_level, ring)
# Calculate the yearly NO2 pollution averages for each ring
yearly_averages <- model_pollution_subset %>%
group_by(year, ring) %>%
summarise(avg_yearly_pollution = mean(pollution_level))
We can now plot the evolution of average pollution levels. If the assumption is met, we should not have the trends in the yearly averages diverging significantly before the introduction of the treatment.
We create a line plot of the yearly averages for each ring. We set a
dashed vertical line on November 30th, 2018 (day of introduction of
Madrid Central) by using the geom_vline() function with x =
2018.91 (since November 30th is the 334th day of the year, which is
around 0.91 as a fraction of the total year).
# Plot the yearly averages for each ring
ggplot(yearly_averages, aes(x = year, y = avg_yearly_pollution, color = ring)) +
geom_line() +
labs(x = "Year",
y = expression("Yearly NO"[2] ~ "Pollution Averages"),
title = expression("Examining Parallel Trends: Evolution of Yearly Average NO"[2] ~ "Levels by Ring (2015 - 2021)"),
color = "Ring") +
scale_color_manual(values = ring_colors) +
geom_vline(xintercept = 2018.91,
linetype = "dashed",
color = "black") +
annotate("text", x = 2019.05, y = 47,
label = "Policy Implementation", vjust = -0.5, hjust = 0, size = 3) +
theme_light() +
theme(plot.title = element_text(size = 11))
There are two main takeaways from this plot.
First, we seem to have parallel trends. Despite some differences in the slopes between the first four rings and the last two, pre-intervention trends look approximately parallel and thus we can assume that any differences in pollution levels after the implementation were solely due to the policy. Put differently, in the absence of the intervention, we assume that nothing would have changed from what had already been observed (i.e. similar slopes but different intercepts reflecting station-specific fixed effects).
Second, we can see that Ring 6 stations are actually quite different from those in the other five rings symbolized by the flatter slope (in red).
Note that in Section 2.4 we removed rural stations from our data (6 out of the 8 stations in the last ring) in order to reflect more accurately the differences between urban and suburban stations in our network. Now we remove the remaining 2 stations from Ring 6 (Aranjuez and Villarejo de Salvanés) which happen to be background suburban and traffic suburban stations, respectively. We do this because, when running our models, we want control and treated stations to be as similar as possible and including Ring 6 stations in the former would threaten that goal. We then remove any levels of the ‘ring’ factor (the one referring to Ring 6 in this case) that are not present in the remaining data.
# Show stations (Ring 6) to remove and their type
model_pollution_df %>% filter(ring == "Ring 6: Excluded") %>% distinct(town_name, station_type)
## # A tibble: 2 × 2
## town_name station_type
## <chr> <chr>
## 1 Aranjuez Suburbana fondo
## 2 Villarejo de Salvanes Suburbana trafico
# Remove remaining stations in "Ring 6: Excluded"
model_pollution_df <- model_pollution_df %>%
filter(ring != "Ring 6: Excluded")
# Remove any levels not present in the dataframe
model_pollution_df$ring <- droplevels(model_pollution_df$ring)
To visualize our final set of stations after removing the ones in Ring 6, we plot below the 40 pollution monitoring stations (color-coded by ring) and 13 weather stations (represented by red crosses) used in our models:
# Filter out excluded stations from the 'data_all_stations' dataset
data_all_stations <- data_all_stations %>%
filter(ring != "Ring 6: Excluded")
# Create plot
ggmap(madrid_map, extent = "normal", maprange = FALSE, alpha = 0.3) +
geom_point(data = data_all_stations,
aes(x = longitude, y = latitude, color = ring),
size = 1.5) +
geom_point(data = weather_stations,
aes(x = longitude, y = latitude),
color = "red",
shape = 4,
size = 3) +
labs(title = "Location of Weather and Pollution Stations in the Community of Madrid",
color = "") +
scale_color_manual(values = c("black",
setNames(ring_colors, levels(data_all_stations$ring))),
labels = c("Ring 1", "Ring 2", "Ring 3", "Ring 4", "Ring 5")) +
theme_void() +
theme(legend.position = "bottom",
legend.background = element_rect(colour = "black", size = 0.3),
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 7),
legend.title = element_text(size = 8),
legend.box.spacing = unit(0.1, "cm"),
legend.box.margin = margin(0, 0, 0, 0),
legend.box.background = element_rect(color = "black", linewidth = 0.2),
plot.title = element_text(size = 12, hjust = 0.5))
Now we perform our last merging of dataframes in order to get all the attributes needed for running the models into ‘did_df’. The dataframe ‘model_pollution_df’ contains information about dates, pollution and stations (location, type, distance from the centre, ring and closest weather station) while ‘weather_full_final_a’ has data on dates, weather stations and weather variables. We can hence join them by the name of the weather station and the date. The former has different names in each dataframe and so we rename it in the first before merging it with the second:
# Rename weather station variable so that it is the same in both datasets
model_pollution_df <- model_pollution_df %>%
rename(weather_station_name = closest_weather_station)
# Merge pollution and weather data
did_df <- model_pollution_df %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
With our final dataframe ready, we can now create our main variables for the DiD models. We first need a dummy taking the value of 1 for observations happening after the introduction of Madrid Central measures (i.e. after and including December 2018), and a value of 0 otherwise.
# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df <- did_df %>%
mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))
Next, we create a series of treatment dummies for each ring to include in our model. We mutate four new columns in the dataframe that take on a value of 1 if the corresponding observation is in the treated group for that ring, and 0 otherwise.
# Create treatment dummies for each ring
did_df <- did_df %>%
mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))
The logical next step within the DiD framework is producing new variables that identify whether a given observation is in the post-treatment period for each treatment group. This is an interaction term that “activates” the treatment for each group only in the period after the LEZ’s entry into force.
# Create interaction terms by multiplying POST by each corresponding treatment ring
did_df <- did_df %>%
mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
The resulting variables have a value of 0 for observations in the control group or in the pre-treatment period, and will have the same value as the ‘POST’ variable for observations in the treated group (depending on the ring) in the post-treatment period.
Since our units of analysis are pollution monitoring stations, we would like to create a variable (‘station_id’) that uniquely identifies each station in the Community of Madrid air quality network. We simply paste the station’s unique code with the code of the town it belongs to.
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df$station_id <- as.factor(paste(did_df$town_code, did_df$station_code, sep="-"))
# Print number of different station codes
length(unique(did_df$station_id))
## [1] 40
We see that, after removing stations with different characteristics than the ones included as controls (i.e. Ring 6), we are left with 40 different air quality monitoring stations. We will use these codes for producing clustered standard errors and creating station fixed effects.
We are done with feature engineering on the side of the “treatment-related variables.” As for the controls, before we mentioned we would like to choose just one out of all three variables containing information about temperature to avoid multicollinearity.
To be sure we are including the most relevant one to explain pollution, we compute the correlation of each of the three variables with our outcome.
# Calculate correlation coefficients
correlations <- cor(did_df[c("pollution_level",
"avg_temperature_monthly",
"min_temperature_monthly",
"max_temperature_monthly")])
# Extract correlation coefficients for outcome
pollution_correlations <- correlations[1, 2:4]
# Print absolute values of correlation coefficients
print(round(abs(pollution_correlations), 2))
## avg_temperature_monthly min_temperature_monthly max_temperature_monthly
## 0.41 0.32 0.38
Average temperature has the highest correlation with pollution, over 0.4 in absolute value. Therefore, we remove the other two variables. We also remove variables containing redundant information from our dataset.
# Remove 'min_temperature_monthly' and 'max_temperature_monthly' from the dataframe
did_df <- did_df %>% select(-c(min_temperature_monthly, max_temperature_monthly))
# Remove redundant variables
did_df <- did_df %>% select(-c(town_code, station_code, type))
Since we now have the final list of our five weather controls, we can check their degree of association with the outcome variable like we did with average temperature as measured by Pearson’s correlation coefficient. Hence, we select the six variables (outcome and five weather controls), compute the correlations and make some adjustments to our correlation plot:
# Select columns containing the variables of interest
cols_of_interest <- c("pollution_level", "avg_temperature_monthly",
"total_rain_monthly", "wind_direction_monthly", "avg_wind_speed_monthly",
"avg_pressure_monthly")
# Compute correlations
cc <- cor(did_df[,cols_of_interest])
# Create an empty plot with a larger top margin
par(mar = c(5, 4, 1, 2))
# Customize the correlation plot and add it to the empty plot
corrplot(cc, method="color", type="upper", order="hclust", tl.col="black",
tl.srt=20, tl.cex=0.8, tl.offset=0.4, diag=FALSE, addrect = 4,
addCoef.col = "black", col = colorRampPalette(c("#FFFFFF", "#7EBCEB", "#08306B"))(200),
mar=c(0,0,1,0))
# Add title to the plot
mtext("Correlation Plot of Weather Controls and Pollution Level", side = 3, line = -0.1, cex = 1.1)
It is clear that correlations among our weather variables are not excessively high, the largest one being the (obviously negative) association between average temperature and total rain measured on a monthly basis.
We also see that, among all weather variables, average temperature and average wind speed seem to be the most highly correlated with pollution. Both have moderate negative linear associations with the outcome variable. Pressure, rain and wind direction are barely associated (linearly) with pollution since their correlation coefficients are close to 0. Still, even if some variables may not be highly correlated with the dependent variable in our models, we will still include them because they may have explanatory power in conjunction with other variables. Furthermore, these covariates may have non-linear relationships with pollution which would go uncaptured by the correlation coefficient (see Section 6.1.I). More importantly, we should be concerned that some of these variables (without concrete knowledge of which) may be responsible for some variability in the dependent variable that must be accounted for to avoid biased estimates in our model.
We create a table summarising the variables in our model, including their names in this code, description and basic summary statistics (mean and standard deviation).
# Define variable names
var_names <- c("pollution_level", "POST",
"treatment_ring1", "treatment_ring2", "treatment_ring3", "treatment_ring4",
"avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly",
"avg_wind_speed_monthly", "avg_pressure_monthly")
# Define variable descriptions
var_descriptions <- c("Average monthly nitrogen dioxide (NO2) in micrograms per cubic meter (µg/m3)",
"Dummy: 1 if observation is after the date of implementation of the LEZ (November 2018) and 0 otherwise",
"Dummy: 1 if pollution monitoring station is in Plaza del Carmen, Plaza de España, Escuelas Aguirre or Parque del Retiro, and 0 otherwise",
"Dummy: 1 if pollution monitoring station is in Castellana, Méndez Alvaro, Cuatro Caminos, Farolillo, Casa de Campo, Plaza Elíptica, Ramon y Cajal, Moratalaz, Plaza Castilla, Vallecas, Arturo Soria or Barrio del Pilar, and 0 otherwise",
"Dummy: 1 if pollution monitoring station is in Villaverde, Juan Carlos I, Sanchinarro, Tres Olivos, Ensanche de Vallecas, Leganés, Urb. Embajada, Getafe, Barajas Pueblo, El Pardo, Coslada, Alcorcón, Alcobendas or Majadahonda, and 0 otherwise",
"Dummy: 1 if pollution monitoring station is in Rivas-Vaciamadrid, Fuenlabrada, Móstoles or Torrejón de Ardoz, and 0 otherwise",
"Monthly average of daily average temperatures in Celsius (C)",
"Monthly total precipitation in millimeters (mm)",
"Monthly average direction of daily maximum wind speed in tens of degrees",
"Monthly average of daily average windspeed in meters per second (m/s)",
"Monthly average of daily minimum and maximum pressure in hectopascals (hPA)")
# Create dataframe with variables names and descriptions
variable_descriptions_df <- data.frame(variable_name = var_names, variable_description = var_descriptions)
# Convert 'variable_name' to a factor variable with the desired order of levels
variable_descriptions_df$variable_name <- factor(variable_descriptions_df$variable_name,
levels = var_names,
ordered = TRUE)
# Calculate summary statistics for the chosen variables
summary_df <- data.frame(
variable_name = var_names,
mean = round(sapply(did_df[, var_names], mean), 2),
SD = round(sapply(did_df[, var_names], sd), 2))
# Remove the row names from the summary dataframe
rownames(summary_df) <- NULL
# Merge summary statistics with variable descriptions
result_df <- variable_descriptions_df %>% left_join(summary_df, by = "variable_name")
# Rename columns in the final dataframe
names(result_df) <- c("Variable Name", "Variable Description", "Mean", "Standard Deviation")
# Order the rows of 'result_df' by the order of levels in variable_name
result_df <- result_df[order(variable_descriptions_df$variable_name),]
# Format table
kable(result_df, format = "html", row.names = FALSE) %>%
kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| Variable Name | Variable Description | Mean | Standard Deviation |
|---|---|---|---|
| pollution_level | Average monthly nitrogen dioxide (NO2) in micrograms per cubic meter (µg/m3) | 32.73 | 14.47 |
| POST | Dummy: 1 if observation is after the date of implementation of the LEZ (November 2018) and 0 otherwise | 0.40 | 0.49 |
| treatment_ring1 | Dummy: 1 if pollution monitoring station is in Plaza del Carmen, Plaza de España, Escuelas Aguirre or Parque del Retiro, and 0 otherwise | 0.10 | 0.30 |
| treatment_ring2 | Dummy: 1 if pollution monitoring station is in Castellana, Méndez Alvaro, Cuatro Caminos, Farolillo, Casa de Campo, Plaza Elíptica, Ramon y Cajal, Moratalaz, Plaza Castilla, Vallecas, Arturo Soria or Barrio del Pilar, and 0 otherwise | 0.30 | 0.46 |
| treatment_ring3 | Dummy: 1 if pollution monitoring station is in Villaverde, Juan Carlos I, Sanchinarro, Tres Olivos, Ensanche de Vallecas, Leganés, Urb. Embajada, Getafe, Barajas Pueblo, El Pardo, Coslada, Alcorcón, Alcobendas or Majadahonda, and 0 otherwise | 0.35 | 0.48 |
| treatment_ring4 | Dummy: 1 if pollution monitoring station is in Rivas-Vaciamadrid, Fuenlabrada, Móstoles or Torrejón de Ardoz, and 0 otherwise | 0.10 | 0.30 |
| avg_temperature_monthly | Monthly average of daily average temperatures in Celsius (C) | 15.27 | 7.19 |
| total_rain_monthly | Monthly total precipitation in millimeters (mm) | 34.18 | 31.47 |
| wind_direction_monthly | Monthly average direction of daily maximum wind speed in tens of degrees | 24.73 | 8.41 |
| avg_wind_speed_monthly | Monthly average of daily average windspeed in meters per second (m/s) | 2.64 | 0.88 |
| avg_pressure_monthly | Monthly average of daily minimum and maximum pressure in hectopascals (hPA) | 934.87 | 19.87 |
While the causal effect of interest is usually extracted by means of regression, we can first exemplify the logic behind DiD by performing the simple calculation below.
We rely on the potential outcome framework. For the sake of simplicity, we assume that only stations inside and immediately outside the LEZ (Plaza del Carmen, Plaza de España, Escuelas Aguirre and Parque del Retiro) are treated (“treatment_ring1==1”) and the rest are not (“treatment_ring1==0”). Recall we also have the time variable indicating whether a given observation was taken before or after the introduction of the policy (“POST==0” and “POST==1”, respectively).
By subtracting the difference in (mean) pollution levels between the treatment (our four stations) and control groups (the remaining stations) before the intervention from the (mean) difference in pollution levels between the treatment and control units after the intervention, we are able to approximate the Average Treatment Effect (ATT). In our case, it is the difference in pollution levels between the treatment group and the control groups, after accounting for any pre-existing differences in pollution levels between both groups.
We display the means for each of the four cases in the table below:
# Mean pollution level for post-treatment and treatment group 1
posttreat <- mean(did_df$pollution_level[did_df$POST==1 & did_df$treatment_ring1==1])
# Mean pollution level for pre-treatment and treatment group 1
pretreat <- mean(did_df$pollution_level[did_df$POST==0 & did_df$treatment_ring1==1])
# Mean pollution level for post-treatment and "control" group (all other stations)
postcontrol <- mean(did_df$pollution_level[did_df$POST==1 & did_df$treatment_ring1==0])
# Mean pollution level for pre-treatment and control group
precontrol <- mean(did_df$pollution_level[did_df$POST==0 & did_df$treatment_ring1==0])
# Calculate the difference in means (DID) between treatment and control groups
did_att <- (posttreat-postcontrol)-(pretreat-precontrol)
# Create a 3x3 table
table_ATT <- matrix(0, nrow = 3, ncol = 3)
# Fill in the table with the differences
table_ATT[1, 1] <- round(pretreat, 2)
table_ATT[1, 2] <- round(posttreat, 2)
table_ATT[1, 3] <- round(posttreat - pretreat, 2)
table_ATT[2, 1] <- round(precontrol, 2)
table_ATT[2, 2] <- round(postcontrol, 2)
table_ATT[2, 3] <- round(postcontrol - precontrol, 2)
table_ATT[3, 1] <- round(pretreat - precontrol, 2)
table_ATT[3, 2] <- round(posttreat - postcontrol, 2)
table_ATT[3, 3] <- round((posttreat - postcontrol) - (pretreat - precontrol), 2)
# Add row and column names
row.names(table_ATT) <- c("TREATMENT", "CONTROL", "DIFFERENCE")
colnames(table_ATT) <- c("PRE", "POST", "DIFFERENCE")
# Print the table
kable(table_ATT, format = "html") %>%
kable_styling(c("striped", "bordered"), full_width = FALSE)
| PRE | POST | DIFFERENCE | |
|---|---|---|---|
| TREATMENT | 45.80 | 33.04 | -12.76 |
| CONTROL | 34.45 | 27.88 | -6.57 |
| DIFFERENCE | 11.35 | 5.16 | -6.19 |
We obtain an ATT of -6.19 (third row, third column - the result of taking the “double difference”), which reflects the average reduction in pollution as measured by NO2 concentrations for the stations in Plaza del Carmen, Plaza de España, Escuelas Aguirre and Parque del Retiro as a result of the policy.
It is time to run our DiD regressions. In practice, DiD
identification strategies are implemented through fixed-effects
models. We run five different ones using the plm()
R package, each with an increasing number of predictors and thus degree
of complexity compared to the previous one.
First of all we transform the month column to a string so that we are later able to remove the 11 levels of that variable from the regression table summarising our five models for the sake of simplicity. We transform the time, treatment and interaction terms into factors.
# Convert month variables to strings
did_df$month <- as.character(did_df$month)
# Transform new variables into factors
did_df <- did_df %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
The models in this section rely on within-group transformations (i.e. ‘model’ argument set to “within”), which allows us to account for individual-specific unobserved heterogeneity (i.e. unit fixed effects) that may affect the dependent variable (‘pollution_level’) by subtracting the average of each station’s observations from the observations of that given station. This transformation effectively removes time-invariant differences across stations, leaving only the within-individual variation to be explained by the regressors and allowing different intercepts for different stations.
We start with the simplest possible model: we regress the pollution level on the time, treatment and interaction dummies assuming only Plaza del Carmen and the three stations closest to it are treated. We include time (in this case, month) fixed effects since our data is at the monthly level and these are thus likely to capture more variability than year fixed effects (see Section 6.1.H). The model automatically incorporates station fixed effects when we set the ‘index’ argument equal to “station_id”.
# Set up the fixed effects model
m1 <- plm(pollution_level ~ POST + treatment_ring1 + POST_treat_1 +
month,
model = "within", index = "station_id",
data = did_df)
summary(m1)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = pollution_level ~ POST + treatment_ring1 + POST_treat_1 +
## month, data = did_df, model = "within", index = "station_id")
##
## Balanced Panel: n = 40, T = 78, N = 3120
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -24.987526 -3.985351 -0.014801 3.700320 33.151361
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## POST1 -6.97690 0.26836 -25.9987 < 2.2e-16 ***
## POST_treat_11 -6.19026 0.84395 -7.3348 2.828e-13 ***
## month10 -8.79217 0.60928 -14.4305 < 2.2e-16 ***
## month11 -5.70158 0.60928 -9.3580 < 2.2e-16 ***
## month12 -0.53002 0.60907 -0.8702 0.3842
## month2 -8.60203 0.58491 -14.7066 < 2.2e-16 ***
## month3 -16.30194 0.58491 -27.8709 < 2.2e-16 ***
## month4 -23.15104 0.58491 -39.5805 < 2.2e-16 ***
## month5 -24.76630 0.58491 -42.3421 < 2.2e-16 ***
## month6 -24.39553 0.58491 -41.7082 < 2.2e-16 ***
## month7 -22.69518 0.60928 -37.2494 < 2.2e-16 ***
## month8 -23.70388 0.60928 -38.9050 < 2.2e-16 ***
## month9 -14.89930 0.60928 -24.4541 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 445050
## Residual Sum of Squares: 146900
## R-Squared: 0.66992
## Adj. R-Squared: 0.66433
## F-statistic: 478.829 on 13 and 3067 DF, p-value: < 2.22e-16
All our variables seem to be significant (except for one level in the ‘month’ variable) and the R-squared is 0.67, meaning 67% of the variability in the pollution level is explained by the variables in our model.
However, we might want to introduce clustered standard errors since residuals may not be evenly distributed across the line of best fit and are thus not “iid” (independent and identically distributed). This could be due to heterogeneity across stations (due to location, closeness to roads or centrality in the region). Hence, to ensure the need for clustering, we perform a Breusch-Pagan test for heteroskedasticity on our model.
# Breusch-Pagan test for heteroskedasticity
bp <- bptest(m1)
print(bp)
##
## studentized Breusch-Pagan test
##
## data: m1
## BP = 246.31, df = 14, p-value < 2.2e-16
Since the p-value is clearly below the significance level of 0.05, we reject the null hypothesis of homoskedastic residuals and conclude that there is evidence of heteroskedasticity in the model.
Therefore, we compute a t-test with clustered standard errors on our
simplest model using the coeftest()
function while specifying the use of heteroskedasticity-consistent (HC)
standard errors. After trying different values for ‘type’, we select HC1
which specifies that the standard errors should be calculated using the
“sandwich estimator.” This yields valid (consistent) standard errors and
more accurate p-values in the presence of heteroskedasticity and/or
autocorrelation. Recall that these two violate the assumptions of
ordinary least squares (OLS) regression as the treatment is unable to
predict the outcome evenly across stations.
It is true that several options (HC0 through HC3) are available for the type of standard errors. For reference, as we move from HC1 to HC2, the standard error increases, so the critical value decreases and the p-value rises. In other words, as we raise the “HC type”, it becomes increasingly more difficult to find significance. Hence, we will stay at HC1 to be conservative with respect to HC0, but not as much in order to allow for significant effects. Still, the choice is not incredibly relevant as we obtain the same results for all “HC types” in terms of which variables are significant.
# Compute t-test with clustered standard errors (HC1)
clustered_test_m1 <- coeftest(m1, vcov = vcovHC(m1, type = "HC2", cluster = "group"))
print(clustered_test_m1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.97690 0.43991 -15.8599 < 2.2e-16 ***
## POST_treat_11 -6.19026 1.53531 -4.0319 5.667e-05 ***
## month10 -8.79217 0.41961 -20.9532 < 2.2e-16 ***
## month11 -5.70158 0.27426 -20.7890 < 2.2e-16 ***
## month12 -0.53002 0.21533 -2.4615 0.01389 *
## month2 -8.60203 0.33125 -25.9683 < 2.2e-16 ***
## month3 -16.30194 0.47108 -34.6053 < 2.2e-16 ***
## month4 -23.15104 0.63093 -36.6934 < 2.2e-16 ***
## month5 -24.76630 0.66914 -37.0124 < 2.2e-16 ***
## month6 -24.39553 0.72300 -33.7423 < 2.2e-16 ***
## month7 -22.69518 0.71482 -31.7495 < 2.2e-16 ***
## month8 -23.70388 0.77106 -30.7420 < 2.2e-16 ***
## month9 -14.89930 0.59572 -25.0106 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our treatment effect (coefficient for ‘POST_treat_1’) for the closest four stations to the city centre (those within a range of around 2.5 kilometres from Plaza del Carmen) is around -6.2, which is the same as the result of our manual calculation of the ATT. This a very clear negative (causal) effect of the implementation of restrictions within Madrid Central on the pollution levels around these four stations.
We also see that all levels of the ‘month’ variable are now significant and have negative coefficients, reflecting the insights introduced in Section 5.1 (Parallel Trends) where aggregate pollution levels showed a negative time trend. In absolute value, the magnitude of this decrease seems to be strongest for the months of May and June and weakest for December as seen in Section 2.4.
We can investigate further how the estimates for Ring 1 vary with
time. Instead of using months, we will use years to avoid having an
excessive number of time points. For that we use a different package (feols())
to estimate normal OLS with our station and (now) year fixed effects
(see Section 6.1.H for more on year fixed effects). We choose the year
of implementation, 2018, as reference year (not shown in the plot).
We perform a series of transformations and plot our estimates for the treatment effect of Ring 1 stations for each given year:
# Convert 'year' into a factor and relevel it with the reference level set to '2018'
did_df_feols <- did_df %>%
mutate(year = relevel(factor(year), ref = '2018'))
# Run `feols()` model
FE_OLS <- feols(pollution_level ~ I(treatment_ring1 == 1)*year | station_id + year,
cluster = "station_id",
data = did_df_feols)
# Specify x-axis labels for the years
year_labels <- c("2015", "2016", "2017", "2019", "2020", "2021")
# Create a dataframe with coefficient estimates, standard errors and year labels
coef_df <- data.frame(coef = coef(FE_OLS), se = se(FE_OLS), year = year_labels)
# Convert to factor
coef_df <- coef_df %>% mutate(year = as.factor(year))
# Create the estimates plot
ggplot(coef_df, aes(x = year, y = coef)) +
geom_errorbar(aes(ymin = coef - se, ymax = coef + se), width = 0.4) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
labs(x = "Year",
y = "Estimate (95% Confidence Interval)",
title = "Dynamic Treatment Effects for Ring 1 Stations by Year") +
theme_light()
We see that before 2018, the confidence intervals for the treatment effect contained 0. However, from 2019 onwards, these are consistently in negative territory. Moreover, it seems that the (negative) impact of the policy on pollution was actually higher in 2020 compared to the first year of full implementation, thereby suggesting a strengthening of treatment effects in the four Ring 1 stations over time.
For Model 2 we still take our four most central stations as the only
treated units. Yet, apart from controlling for the month and station
through fixed effects, we add our weather covariates. We also remove the
treatment dummies from the regression model as the plm()
package produces the same results with and without them (i.e. it
understands the treatment dummy to be implicitly there).
# Set up the fixed effects model
m2 <- plm(pollution_level ~ POST + POST_treat_1 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
summary(m2)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + avg_temperature_monthly +
## total_rain_monthly + avg_wind_speed_monthly + wind_direction_monthly +
## avg_pressure_monthly + month, data = did_df, model = "within",
## index = "station_id")
##
## Balanced Panel: n = 40, T = 78, N = 3120
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -22.519760 -3.759124 -0.078218 3.701868 33.160595
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## POST1 -5.9473473 0.2432304 -24.4515 < 2.2e-16 ***
## POST_treat_11 -4.8299489 0.7353164 -6.5685 5.948e-11 ***
## avg_temperature_monthly -0.1435042 0.0722170 -1.9871 0.0469974 *
## total_rain_monthly -0.0513408 0.0044325 -11.5827 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9349604 0.2085901 -23.6586 < 2.2e-16 ***
## wind_direction_monthly -0.0750806 0.0227845 -3.2952 0.0009946 ***
## avg_pressure_monthly -0.0230897 0.0194141 -1.1893 0.2344021
## month10 -8.0155215 0.8938668 -8.9672 < 2.2e-16 ***
## month11 -4.9007961 0.6203175 -7.9005 3.841e-15 ***
## month12 -2.1510398 0.5438896 -3.9549 7.829e-05 ***
## month2 -6.8213455 0.5372431 -12.6969 < 2.2e-16 ***
## month3 -11.9980392 0.6287979 -19.0809 < 2.2e-16 ***
## month4 -18.4655908 0.7748336 -23.8317 < 2.2e-16 ***
## month5 -21.3357585 0.9996020 -21.3443 < 2.2e-16 ***
## month6 -21.0576208 1.2912756 -16.3076 < 2.2e-16 ***
## month7 -19.7359705 1.5895186 -12.4163 < 2.2e-16 ***
## month8 -21.4271853 1.4938446 -14.3437 < 2.2e-16 ***
## month9 -13.9027945 1.1942037 -11.6419 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 445050
## Residual Sum of Squares: 110570
## R-Squared: 0.75155
## Adj. R-Squared: 0.74692
## F-statistic: 514.578 on 18 and 3062 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m2 <- coeftest(m2, vcov = vcovHC(m2, type = "HC1", cluster = "group"))
print(clustered_test_m2)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.9473473 0.4577064 -12.9938 < 2.2e-16 ***
## POST_treat_11 -4.8299489 1.5337223 -3.1492 0.001653 **
## avg_temperature_monthly -0.1435042 0.0829049 -1.7310 0.083561 .
## total_rain_monthly -0.0513408 0.0056766 -9.0443 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9349604 0.3855256 -12.8006 < 2.2e-16 ***
## wind_direction_monthly -0.0750806 0.0246257 -3.0489 0.002317 **
## avg_pressure_monthly -0.0230897 0.0426763 -0.5410 0.588518
## month10 -8.0155215 0.9742748 -8.2272 2.802e-16 ***
## month11 -4.9007961 0.5286663 -9.2701 < 2.2e-16 ***
## month12 -2.1510398 0.2750942 -7.8193 7.251e-15 ***
## month2 -6.8213455 0.4013852 -16.9945 < 2.2e-16 ***
## month3 -11.9980392 0.6871073 -17.4617 < 2.2e-16 ***
## month4 -18.4655908 0.9147885 -20.1856 < 2.2e-16 ***
## month5 -21.3357585 1.2068775 -17.6785 < 2.2e-16 ***
## month6 -21.0576208 1.5158347 -13.8918 < 2.2e-16 ***
## month7 -19.7359705 1.8265641 -10.8050 < 2.2e-16 ***
## month8 -21.4271853 1.7942757 -11.9420 < 2.2e-16 ***
## month9 -13.9027945 1.3480227 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that all controls are statistically significant at the 0.05 significance level except average pressure and average temperature. Our adjusted R-squared (which penalizes the inclusion of irrelevant variables into the model) has risen to around 74.7%. This means that meteorological information is actually relevant at explaining the variability in pollution levels.
Our treatment effect for the closest four stations to the city centre is now around -4.8 with a p-value of 0.002. That is, the policy has led to a significant decrease of 4.8 µg/m3 in NO2 concentrations inside and within the immediate neighbourhood of Madrid Central after controlling for the effects of meteorological conditions. This is lower than the effect found in Model 1 in absolute value. This reduction means that, by incorporating weather controls, we were able to “strip away” some of the effect on pollution that was actually not caused by the restrictions in Madrid Central but by fluctuating meteorological conditions.
Model 3 retains the same fixed effects and controls, but we now incorporate the second treatment ring (stations within 2.5 and 10 kilometres from Plaza del Carmen) in addition to the first.
# Set up the fixed effects model
m3 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
summary(m3)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 +
## avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
## wind_direction_monthly + avg_pressure_monthly + month, data = did_df,
## model = "within", index = "station_id")
##
## Balanced Panel: n = 40, T = 78, N = 3120
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -22.506609 -3.815558 -0.084024 3.734732 32.810488
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## POST1 -5.5476293 0.2942513 -18.8534 < 2.2e-16 ***
## POST_treat_11 -5.2336318 0.7535959 -6.9449 4.607e-12 ***
## POST_treat_21 -1.1955516 0.4961391 -2.4097 0.0160238 *
## avg_temperature_monthly -0.1553601 0.0723279 -2.1480 0.0317922 *
## total_rain_monthly -0.0512628 0.0044292 -11.5739 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9003670 0.2089204 -23.4557 < 2.2e-16 ***
## wind_direction_monthly -0.0802765 0.0228685 -3.5103 0.0004540 ***
## avg_pressure_monthly -0.0228754 0.0193990 -1.1792 0.2384096
## month10 -7.8904847 0.8946720 -8.8194 < 2.2e-16 ***
## month11 -4.8495608 0.6201957 -7.8194 7.245e-15 ***
## month12 -2.1150944 0.5436679 -3.8904 0.0001022 ***
## month2 -6.8103280 0.5368414 -12.6859 < 2.2e-16 ***
## month3 -11.9726520 0.6283933 -19.0528 < 2.2e-16 ***
## month4 -18.4029092 0.7746630 -23.7560 < 2.2e-16 ***
## month5 -21.2153214 1.0000680 -21.2139 < 2.2e-16 ***
## month6 -20.8785400 1.2924017 -16.1548 < 2.2e-16 ***
## month7 -19.5080971 1.5910851 -12.2609 < 2.2e-16 ***
## month8 -21.2159465 1.4952454 -14.1889 < 2.2e-16 ***
## month9 -13.7264403 1.1955096 -11.4817 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 445050
## Residual Sum of Squares: 110360
## R-Squared: 0.75202
## Adj. R-Squared: 0.74732
## F-statistic: 488.566 on 19 and 3061 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m3 <- coeftest(m3, vcov = vcovHC(m3, type = "HC1", cluster = "group"))
print(clustered_test_m3)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.5476293 0.6195936 -8.9537 < 2.2e-16 ***
## POST_treat_11 -5.2336318 1.5831400 -3.3059 0.0009578 ***
## POST_treat_21 -1.1955516 0.7701380 -1.5524 0.1206732
## avg_temperature_monthly -0.1553601 0.0806890 -1.9254 0.0542696 .
## total_rain_monthly -0.0512628 0.0057434 -8.9254 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9003670 0.3867853 -12.6695 < 2.2e-16 ***
## wind_direction_monthly -0.0802765 0.0246313 -3.2591 0.0011298 **
## avg_pressure_monthly -0.0228754 0.0429907 -0.5321 0.5946946
## month10 -7.8904847 0.9652652 -8.1744 4.305e-16 ***
## month11 -4.8495608 0.5236444 -9.2612 < 2.2e-16 ***
## month12 -2.1150944 0.2730168 -7.7471 1.269e-14 ***
## month2 -6.8103280 0.3999415 -17.0283 < 2.2e-16 ***
## month3 -11.9726520 0.6822960 -17.5476 < 2.2e-16 ***
## month4 -18.4029092 0.9073398 -20.2823 < 2.2e-16 ***
## month5 -21.2153214 1.1949235 -17.7545 < 2.2e-16 ***
## month6 -20.8785400 1.4974873 -13.9424 < 2.2e-16 ***
## month7 -19.5080971 1.7986847 -10.8458 < 2.2e-16 ***
## month8 -21.2159465 1.7673268 -12.0045 < 2.2e-16 ***
## month9 -13.7264403 1.3278991 -10.3370 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The adjusted R-squared is virtually the same. Our treatment effect for stations in Ring 1 is now around -5.2 (p-value = 0.001) and the treatment effect for stations in Ring 2 is about -1.2 (p-value = 0.1). There is an evident decrease in the absolute magnitude of the coefficient for Ring 2 compared to Ring 1, with the effect for the stations in the former ring being insignificant. Hence, the reduction of NO2 in the air was much weaker in the stations farther away from Plaza del Carmen than in those stations in its immediate vicinity (or the centroid station itself). Moreover, the impact of the policy on pollution levels in Ring 1 stations is amplified once we hold the second treatment ring constant. See Section 6.1.D for an alternative specification without this ceteris paribus implication.
We keep the same variables in the model but now we add the third ring of stations, those that lie within 10 and 15 kilometres from the only station inside the LEZ.
# Set up the fixed effects model
m4 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
summary(m4)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 +
## POST_treat_3 + avg_temperature_monthly + total_rain_monthly +
## avg_wind_speed_monthly + wind_direction_monthly + avg_pressure_monthly +
## month, data = did_df, model = "within", index = "station_id")
##
## Balanced Panel: n = 40, T = 78, N = 3120
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -22.504041 -3.782079 -0.087779 3.754598 32.824383
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## POST1 -5.0031488 0.4431765 -11.2893 < 2.2e-16 ***
## POST_treat_11 -5.7844968 0.8246534 -7.0145 2.83e-12 ***
## POST_treat_21 -1.7447812 0.5981740 -2.9168 0.0035616 **
## POST_treat_31 -0.9474833 0.5768037 -1.6426 0.1005593
## avg_temperature_monthly -0.1578420 0.0723236 -2.1824 0.0291528 *
## total_rain_monthly -0.0512021 0.0044281 -11.5630 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9039605 0.2088739 -23.4781 < 2.2e-16 ***
## wind_direction_monthly -0.0775976 0.0229203 -3.3855 0.0007193 ***
## avg_pressure_monthly -0.0229512 0.0193937 -1.1834 0.2367281
## month10 -7.8687657 0.8945216 -8.7966 < 2.2e-16 ***
## month11 -4.8408039 0.6200467 -7.8072 7.97e-15 ***
## month12 -2.1141712 0.5435174 -3.8898 0.0001025 ***
## month2 -6.8023153 0.5367147 -12.6740 < 2.2e-16 ***
## month3 -11.9587836 0.6282758 -19.0343 < 2.2e-16 ***
## month4 -18.3835960 0.7745375 -23.7349 < 2.2e-16 ***
## month5 -21.1822229 0.9999938 -21.1824 < 2.2e-16 ***
## month6 -20.8331133 1.2923393 -16.1205 < 2.2e-16 ***
## month7 -19.4469611 1.5910793 -12.2225 < 2.2e-16 ***
## month8 -21.1570753 1.4952603 -14.1494 < 2.2e-16 ***
## month9 -13.6853587 1.1954397 -11.4480 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 445050
## Residual Sum of Squares: 110270
## R-Squared: 0.75224
## Adj. R-Squared: 0.74746
## F-statistic: 464.53 on 20 and 3060 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m4 <- coeftest(m4, vcov = vcovHC(m4, type = "HC1", cluster = "group"))
print(clustered_test_m4)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.0031488 0.9450035 -5.2943 1.279e-07 ***
## POST_treat_11 -5.7844968 1.7413320 -3.3219 0.0009046 ***
## POST_treat_21 -1.7447812 1.0579585 -1.6492 0.0992101 .
## POST_treat_31 -0.9474833 1.2405338 -0.7638 0.4450628
## avg_temperature_monthly -0.1578420 0.0806674 -1.9567 0.0504733 .
## total_rain_monthly -0.0512021 0.0058295 -8.7833 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9039605 0.3847909 -12.7445 < 2.2e-16 ***
## wind_direction_monthly -0.0775976 0.0264252 -2.9365 0.0033441 **
## avg_pressure_monthly -0.0229512 0.0430273 -0.5334 0.5937879
## month10 -7.8687657 0.9646806 -8.1569 4.964e-16 ***
## month11 -4.8408039 0.5223767 -9.2669 < 2.2e-16 ***
## month12 -2.1141712 0.2741693 -7.7112 1.674e-14 ***
## month2 -6.8023153 0.3988911 -17.0531 < 2.2e-16 ***
## month3 -11.9587836 0.6799791 -17.5870 < 2.2e-16 ***
## month4 -18.3835960 0.9065672 -20.2782 < 2.2e-16 ***
## month5 -21.1822229 1.1917552 -17.7740 < 2.2e-16 ***
## month6 -20.8331133 1.4916507 -13.9665 < 2.2e-16 ***
## month7 -19.4469611 1.7930370 -10.8458 < 2.2e-16 ***
## month8 -21.1570753 1.7650425 -11.9867 < 2.2e-16 ***
## month9 -13.6853587 1.3269278 -10.3136 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, the R-squared has not changed much by adding new stations, which makes sense since it is the addition of weather controls and not the incorporation of new units into the treatment that enhances the explainability of the model. Our treatment effects for Rings 1, 2 and 3 are now: -5.8 (p-value = 0.001), -1.7 (p-value = 0.1) and -0.9 (p-value = 0.4), respectively.
Only the effect of the policy for stations in Ring 1 is significant. The effects are insignificant for those in Rings 2 and 3 at the 0.05 significance level, with this “significance being much lower” in Ring 3 compared to Ring 2. This makes sense as Ring 3 stations are farther away from the “restricted area” than Ring 2 stations. These results show that, if there were any important spillover effects, these would be stronger and more likely to be statistically significant in stations closer to where the policy is actually active.
Our last model is the most complete one as it incorporates all controls and fixed effects, as well as all stations aside from those in the control group in four different rings (with the last one comprising stations within 20 and 40 kilometres from Plaza del Carmen).
# Set up the fixed effects model
m5 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = c("station_id"),
data = did_df)
summary(m5)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 +
## POST_treat_3 + POST_treat_4 + avg_temperature_monthly + total_rain_monthly +
## avg_wind_speed_monthly + wind_direction_monthly + avg_pressure_monthly +
## month, data = did_df, model = "within", index = c("station_id"))
##
## Balanced Panel: n = 40, T = 78, N = 3120
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -22.503967 -3.784694 -0.079589 3.752970 32.819618
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## POST1 -5.1657638 0.5693263 -9.0735 < 2.2e-16 ***
## POST_treat_11 -5.6205582 0.8999958 -6.2451 4.821e-10 ***
## POST_treat_21 -1.5817642 0.6972897 -2.2684 0.0233714 *
## POST_treat_31 -0.7830113 0.6807329 -1.1502 0.2501318
## POST_treat_41 0.4086759 0.8979933 0.4551 0.6490703
## avg_temperature_monthly -0.1580053 0.0723339 -2.1844 0.0290094 *
## total_rain_monthly -0.0512371 0.0044293 -11.5677 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9034607 0.2089039 -23.4723 < 2.2e-16 ***
## wind_direction_monthly -0.0781354 0.0229537 -3.4040 0.0006725 ***
## avg_pressure_monthly -0.0230390 0.0193972 -1.1878 0.2350240
## month10 -7.8664352 0.8946522 -8.7927 < 2.2e-16 ***
## month11 -4.8397316 0.6201315 -7.8044 8.146e-15 ***
## month12 -2.1132348 0.5435918 -3.8875 0.0001034 ***
## month2 -6.8025758 0.5367846 -12.6728 < 2.2e-16 ***
## month3 -11.9585966 0.6283573 -19.0315 < 2.2e-16 ***
## month4 -18.3824035 0.7746423 -23.7302 < 2.2e-16 ***
## month5 -21.1811981 1.0001259 -21.1785 < 2.2e-16 ***
## month6 -20.8315224 1.2925115 -16.1171 < 2.2e-16 ***
## month7 -19.4455457 1.5912885 -12.2200 < 2.2e-16 ***
## month8 -21.1560681 1.4954557 -14.1469 < 2.2e-16 ***
## month9 -13.6836725 1.1956004 -11.4450 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 445050
## Residual Sum of Squares: 110260
## R-Squared: 0.75226
## Adj. R-Squared: 0.7474
## F-statistic: 442.305 on 21 and 3059 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m5 <- coeftest(m5, vcov = vcovHC(m5, type = "HC1", cluster = "group"))
print(clustered_test_m5)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.1657638 1.1887487 -4.3455 1.435e-05 ***
## POST_treat_11 -5.6205582 1.8956858 -2.9649 0.003051 **
## POST_treat_21 -1.5817642 1.2960655 -1.2204 0.222394
## POST_treat_31 -0.7830113 1.4334775 -0.5462 0.584946
## POST_treat_41 0.4086759 1.9585440 0.2087 0.834725
## avg_temperature_monthly -0.1580053 0.0806897 -1.9582 0.050299 .
## total_rain_monthly -0.0512371 0.0058207 -8.8026 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9034607 0.3849906 -12.7366 < 2.2e-16 ***
## wind_direction_monthly -0.0781354 0.0258846 -3.0186 0.002560 **
## avg_pressure_monthly -0.0230390 0.0430191 -0.5356 0.592306
## month10 -7.8664352 0.9636218 -8.1634 4.708e-16 ***
## month11 -4.8397316 0.5216035 -9.2786 < 2.2e-16 ***
## month12 -2.1132348 0.2742367 -7.7059 1.744e-14 ***
## month2 -6.8025758 0.3991540 -17.0425 < 2.2e-16 ***
## month3 -11.9585966 0.6806615 -17.5691 < 2.2e-16 ***
## month4 -18.3824035 0.9066941 -20.2741 < 2.2e-16 ***
## month5 -21.1811981 1.1924890 -17.7622 < 2.2e-16 ***
## month6 -20.8315224 1.4925041 -13.9574 < 2.2e-16 ***
## month7 -19.4455457 1.7941964 -10.8380 < 2.2e-16 ***
## month8 -21.1560681 1.7659380 -11.9801 < 2.2e-16 ***
## month9 -13.6836725 1.3267787 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This last model is equally explanatory as the previous three and the treatment effects for Rings 1, 2, 3 and 4 are around -5.6 (p-value = 0.003), -1.6 (p-value = 0.2), -0.8 (p-value = 0.6) and 0.4 (p-value = 0.8). Again, from Ring 2 onwards the treatment effects lose statistical significance with a decreasing level of significance as our stations are located farther away from the centre. The coefficient and p-value for the treatment group consisting of Ring 4 stations are as expected: close to 0 (even positive!) and quite high, respectively.
We now delve a bit deeper into the interpretation of coefficients other than the interaction term in Model 5. Note that our model outputs do not show an intercept. This is because the ‘within model’ is fitting a separate intercept for each station. We extract these below:
# Extract fixed effects of the "within model" m5
fixef(m5)
## 123-2 14-2 148-4 161-1 45-2 47-2 49-3 5-2 58-4 6-4
## 90.277 79.740 84.392 79.891 77.656 91.546 99.463 88.942 88.331 86.426
## 65-14 7-4 74-7 79-11 79-16 79-17 79-18 79-24 79-27 79-35
## 93.049 85.901 92.868 93.983 88.033 98.605 87.304 73.877 93.845 95.345
## 79-36 79-38 79-39 79-4 79-40 79-47 79-48 79-49 79-50 79-54
## 89.713 93.075 90.320 95.650 88.984 87.650 88.722 82.362 91.761 88.414
## 79-55 79-56 79-57 79-58 79-59 79-60 79-8 80-3 9-1 92-5
## 98.810 104.255 83.921 68.957 82.969 84.124 105.413 80.783 73.142 82.692
Likewise, the coefficient for ‘POST’ (about -5.2 for Model 5) represents how much the average outcome of the control stations has changed in the post-treatment period. It is significant in all our models, implying pollution has decreased over time in the control stations. This confirms the negative trend observed when investigating parallel trends.
Visualization of treatment effects
We can plot the coefficient estimates for the interaction term (i.e. treatment effect) for each of the four rings and their respective confidence intervals. We extract the coefficient estimates and standard errors and build with them a dataframe, whose contents we plot below:
# Extract coefficient estimates and standard errors
coefficients <- coef(clustered_test_m5)
se <- sqrt(diag(vcovHC(m5, type = "HC1", cluster = "group")))
# Create a dataframe for plotting
df_plot <- data.frame(
treatment_ring = c("Ring 1 (0 - 2.5km)", "Ring 2 (2.5 - 8km)", "Ring 3 (8 - 15km)", "Ring 4 (15 - 20km)"),
coefficient = coefficients[grep("POST_treat_", names(coefficients))],
SE = se[grep("POST_treat_", names(se))])
# Plot the coefficient estimates with confidence intervals
ggplot(df_plot, aes(x = treatment_ring, y = coefficient)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
geom_errorbar(aes(ymin = coefficient - 1.96 * SE,
ymax = coefficient + 1.96 * SE),
width = 0.2) +
labs(x = "Treatment Ring",
y = "Estimate (95% Confidence Interval)",
title = "Treatment Effect Estimates by Ring") +
ylim(-10, 5) +
theme_light()
Unsurprisingly, only the treatment effect for Ring 1 is significantly different from 0 (in this case, negative). The coefficient estimates get closer and closer to 0 as stations farther away from Plaza del Carmen are considered as “treated”, with that of the last ring being marginally positive.
Lastly, we export the results of our five models. To do that, we
first create a vector of month dummies in order to exclude them from the
regression table for ease of visualization and because they do not
provide relevant information about the significance of spillovers. We
also set the names of the variables for better readability. Next, since
we do not want the original model object (e.g. ‘m5’) but rather the
model with the correct (clustered) standard errors
(e.g. ‘clustered_test_m5’), we create our own function that calculates
these for all five models (and also for those used to assess our most
complete model’s robustness in the remainder of this document). This is
done because inserting our previously created model objects with
clustered standard errors directly into the stargazer()
function returned an error once more than 3 models were included.
# Create vector of month dummies to be excluded from the regression table
exclude_vars <- paste0("month", 2:12)
# Assign variables for regression table
variable_labels <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"POST_treat_21" = "Post-Treatment*Ring 2",
"POST_treat_31" = "Post-Treatment*Ring 3",
"POST_treat_41" = "Post-Treatment*Ring 4",
"avg_temperature_monthly" = "Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure")
# Define function to calculate clustered standard errors
se_clustered <- function(x)
coeftest(x, vcov = vcovHC(x, type = "HC1", cluster = "group"))[, "Std. Error"]
# Create the list of five models to include in the table
models <- list(m1, m2, m3, m4, m5)
# Generate regression table for 5 different models without the month dummies
stargazer(models,
se = lapply(models, se_clustered),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
omit = exclude_vars,
covariate.labels = variable_labels,
omit.stat=c("f", "ser"))
##
## Regression Results
## ========================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## ---------------------------------------------------------
## Model 1 Model 2 Model 3 Model 4 Model 5
## (1) (2) (3) (4) (5)
## ----------------------------------------------------------------------------------------
## Post-Treatment -6.977*** -5.947*** -5.548*** -5.003*** -5.166***
## (0.440) (0.458) (0.620) (0.945) (1.189)
##
## Post-Treatment*Ring 1 -6.190*** -4.830*** -5.234*** -5.784*** -5.621***
## (1.533) (1.534) (1.583) (1.741) (1.896)
##
## Post-Treatment*Ring 2 -1.196 -1.745* -1.582
## (0.770) (1.058) (1.296)
##
## Post-Treatment*Ring 3 -0.947 -0.783
## (1.241) (1.433)
##
## Post-Treatment*Ring 4 0.409
## (1.959)
##
## Average Monthly Temperature -0.144* -0.155* -0.158* -0.158*
## (0.083) (0.081) (0.081) (0.081)
##
## Total Monthly Rain -0.051*** -0.051*** -0.051*** -0.051***
## (0.006) (0.006) (0.006) (0.006)
##
## Average Monthly Wind Speed -4.935*** -4.900*** -4.904*** -4.903***
## (0.386) (0.387) (0.385) (0.385)
##
## Average Monthly Wind Direction -0.075*** -0.080*** -0.078*** -0.078***
## (0.025) (0.025) (0.026) (0.026)
##
## Average Monthly Pressure -0.023 -0.023 -0.023 -0.023
## (0.043) (0.043) (0.043) (0.043)
##
## ----------------------------------------------------------------------------------------
## Observations 3,120 3,120 3,120 3,120 3,120
## R2 0.670 0.752 0.752 0.752 0.752
## Adjusted R2 0.664 0.747 0.747 0.747 0.747
## ========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The last step in our analysis consists of performing a series of robustness checks. We use these to test the validity and reliability of our results to different assumptions or specifications of the model.
We will start by introducing several changes to the model with the highest number of regressors (Model 5), consisting of exploring a different level of data aggregation and period fixed effects, using different specifications of the treatment groups in the model, employing a different model set-up and adding as well as modifying our controls.
We then run two placebo tests. We use these to test the validity of our causal inference strategy by examining the effect of the intervention on the outcome variable at a time where no effect is expected and by assessing the impact of this same treatment on a theoretically unrelated outcome.
The first robustness check consists of testing whether aggregating data at the weekly (rather than monthly) level has any impact on the significance of the policy effect.
We repeat the steps followed in the first sections of this document, except for the difference in the aggregation level for both pollution and weather data.
Pollution data for pollution monitoring stations inside and outside the City of Madrid (2015 - 2022)
We first load, transform and merge datasets with pollution information from inside as well as outside the City of Madrid. We then combine this with information about the pollution monitoring stations. We remove both observations taking place after July 2021 and in Ring 6 stations.
# Aggregate in-City data to weekly level
city_weekly <- all_city_data_clean %>%
mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
group_by(town_code, station_code, year, month, week) %>%
summarise(pollution_level = mean(pollution_level))
# Aggregate out-of-City data to weekly level
com_weekly <- all_com_data_clean %>%
mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
group_by(town_code, station_code, year, month, week) %>%
summarise(pollution_level = mean(pollution_level))
# Bind together observations of both datasets
full_data_w <- bind_rows(city_weekly, com_weekly)
# Join 'city_weekly' with 'city_stations' by 'town_code' and 'station_code'
city_data_w <- city_weekly %>%
left_join(city_stations_clean, by = c("town_code", "station_code"))
# Join 'com_weekly' with 'com_stations' by 'town_code' and 'station_code'
com_data_w <- com_weekly %>%
left_join(com_stations_clean, by = c("town_code", "station_code"))
# Combine both dataframes into a single one for the whole Community
full_data_complete_w <- bind_rows(city_data_w, com_data_w) %>%
select(-c(station_name, rural_area_type))
# Replace the missing values in 'town_name' by Madrid
full_data_complete_w$town_name <- ifelse(is.na(full_data_complete_w$town_name), "Madrid", full_data_complete_w$town_name)
# Merge data with monthly pollution averages and station data (including rings and nearby weather stations)
model_pollution_df_w <- full_data_complete_w %>%
left_join(data_all_stations, by = c("town_code", "station_code", "address", "station_type",
"longitude", "latitude", "distance_km")) %>%
ungroup()
# Remove data from July 2021 onwards
model_pollution_df_w <- model_pollution_df_w %>%
filter(year < 2021 | (year == 2021 & month < 7)) %>%
# Remove stations in "Ring 6: Excluded"
filter(ring != "Ring 6: Excluded")
# Remove any levels not present in the dataframe
model_pollution_df_w$ring <- droplevels(model_pollution_df_w$ring)
Weather data (2015 - 2021)
Concerning weather data, we only need to modify the last step of Section 4 (having done imputation) and change the level of aggregation.
# Group imputed weather variables and compute weekly averages
weather_full_final_w <- weather_full_final %>%
mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
group_by(weather_station_name, year, month, week) %>%
summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
min_temperature_monthly = round(min(min_temperature), 2),
max_temperature_monthly = round(max(min_temperature), 2),
total_rain_monthly = round(sum(total_rain), 2),
wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))
Pre-modelling feature engineering
As for the pre-modelling stage, we merge our datasets with pollution and weather information, after which we create the post-treatment, treatment and interaction variables. We transform these into factors and ‘month’ into a character variable. We create the unique identifier for each station and remove irrelevant variables.
# Rename weather station variable so that it is the same in both datasets
model_pollution_df_w <- model_pollution_df_w %>%
rename(weather_station_name = closest_weather_station)
# Merge pollution and weather data
did_df_w <- model_pollution_df_w %>%
left_join(weather_full_final_w,
by = c("year", "month", "week", "weather_station_name"))
# Create post-treatment, treatment and interaction terms
did_df_w <- did_df_w %>% mutate(
# Create post-treatment dummy (assuming implementation is in November 2015)
POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0),
# Create treatment dummy for all 4 rings
treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0),
# Create interaction terms by multiplying POST and treatment ring
POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
# Transform new variables into factors
did_df_w <- did_df_w %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
# Convert month variables to strings
did_df_w$month <- as.character(did_df_w$month)
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_w$station_id <- as.factor(paste(did_df_w$town_code, did_df_w$station_code, sep="-"))
# Remove unnecessary variables
did_df_w <- did_df_w %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
Modelling
Finally we run Model 5 on weekly data. We include all four treatment rings and all five weather covariates.
We still include month fixed effects since these are probably enough to capture time-dependent variability. Adding week fixed effects would amount to having too many variables in the model without a much higher degree of control or explainability.
We compute clustered standard errors and display results directly in the regression table.
# Set up the fixed effects model
m5_w <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = c("station_id"),
data = did_df_w)
# T-test with clustered standard errors
clustered_test_m5_w <- coeftest(m5_w, vcov = vcovHC(m5_w, type = "HC1", cluster = "group"))
print(clustered_test_m5_w)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -4.9767423 1.1820969 -4.2101 2.566e-05 ***
## POST_treat_11 -5.7751098 1.9291483 -2.9936 0.002761 **
## POST_treat_21 -1.6400491 1.2988664 -1.2627 0.206723
## POST_treat_31 -1.1019324 1.4237000 -0.7740 0.438947
## POST_treat_41 0.1326552 2.0461106 0.0648 0.948308
## avg_temperature_monthly -0.1409170 0.0500996 -2.8127 0.004918 **
## total_rain_monthly -0.1090991 0.0093691 -11.6445 < 2.2e-16 ***
## avg_wind_speed_monthly -5.1668277 0.3222035 -16.0359 < 2.2e-16 ***
## wind_direction_monthly 0.0097103 0.0135465 0.7168 0.473499
## avg_pressure_monthly 0.0669665 0.0240300 2.7868 0.005330 **
## month10 -6.6578146 0.5727633 -11.6240 < 2.2e-16 ***
## month11 -4.0874453 0.3413810 -11.9733 < 2.2e-16 ***
## month12 -1.2063453 0.2147723 -5.6169 1.977e-08 ***
## month2 -6.1495431 0.4514727 -13.6211 < 2.2e-16 ***
## month3 -10.5552064 0.5738735 -18.3929 < 2.2e-16 ***
## month4 -18.0118064 0.7102636 -25.3593 < 2.2e-16 ***
## month5 -20.6660071 0.9038294 -22.8649 < 2.2e-16 ***
## month6 -19.0568242 1.0642778 -17.9059 < 2.2e-16 ***
## month7 -18.5840295 1.2311677 -15.0946 < 2.2e-16 ***
## month8 -19.8019059 1.2263895 -16.1465 < 2.2e-16 ***
## month9 -12.6234460 0.8631197 -14.6254 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_w <- se_clustered(m5_w)
# Export results
stargazer(m5_w, se = list(se_m5_w),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Weekly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Weekly Aggregation)"),
omit = exclude_vars,
covariate.labels = variable_labels)
##
## Regression Results
## ===================================================================================
## Dependent Variable: Average Weekly NO2 Concentration
## ----------------------------------------------------
## Model 5 (Weekly Aggregation)
## -----------------------------------------------------------------------------------
## Post-Treatment -4.977***
## (1.182)
##
## Post-Treatment*Ring 1 -5.775***
## (1.929)
##
## Post-Treatment*Ring 2 -1.640
## (1.299)
##
## Post-Treatment*Ring 3 -1.102
## (1.424)
##
## Post-Treatment*Ring 4 0.133
## (2.046)
##
## Average Monthly Temperature -0.141***
## (0.050)
##
## Total Monthly Rain -0.109***
## (0.009)
##
## Average Monthly Wind Speed -5.167***
## (0.322)
##
## Average Monthly Wind Direction 0.010
## (0.014)
##
## Average Monthly Pressure 0.067***
## (0.024)
##
## -----------------------------------------------------------------------------------
## Observations 16,138
## R2 0.572
## Adjusted R2 0.570
## F Statistic 1,021.310*** (df = 21; 16077)
## ===================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We see that we now have 16,138 observations, compared to the 3,120 we had when aggregating by month. This could have potentially masked important variability or information, for instance by amplifying measurement error. However, this does not seem to be the case as we obtain essentially the same results in both “aggregation levels”: only average pollution in stations in the first treatment ring seems to be significantly affected by the implementation of Madrid Central (the estimated effect is now around -5.7 µg/m3 compared to -5.6 µg/m3 in the model with monthly observations).
Therefore, we can state the results found in Section 5 are robust to a different level of aggregation.
Now, rather than testing the robustness of our results in the strict
sense, we want to show that changing the model setup (while still using
the plm() package) does not change results
significantly.
While we still set the ‘model’ argument equal to “within” to specify the use of the fixed effects estimator, the addition of ‘effect = “twoways”’ indicates more explicitly that we want to estimate both individual and time fixed effects. We hence index not only by “station_id” but also by “month”, so that all observations are nested within the unique station and month they belong to.
# Set up the fixed effects model
m5_two <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly,
model = "within",
effect = "twoways",
index = c("station_id", "month"),
data = did_df)
# T-test with clustered standard errors
clustered_test_m5_two <- coeftest(m5_two, vcov = vcovHC(m5_two, type = "HC1", cluster = "group"))
print(clustered_test_m5_two)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.1657638 1.1866445 -4.3533 1.385e-05 ***
## POST_treat_11 -5.6205582 1.8923304 -2.9702 0.002999 **
## POST_treat_21 -1.5817642 1.2937714 -1.2226 0.221575
## POST_treat_31 -0.7830113 1.4309402 -0.5472 0.584281
## POST_treat_41 0.4086759 1.9550772 0.2090 0.834436
## avg_temperature_monthly -0.1580053 0.0805469 -1.9617 0.049893 *
## total_rain_monthly -0.0512371 0.0058104 -8.8182 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9034607 0.3843092 -12.7592 < 2.2e-16 ***
## wind_direction_monthly -0.0781354 0.0258388 -3.0240 0.002516 **
## avg_pressure_monthly -0.0230390 0.0429430 -0.5365 0.591650
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimates (and significance levels) for each of the four treatment rings are the same as before, but we are now unable to check on the ceteris paribus effect of each month fixed effect as these are no longer included explicitly in the model.
We export our results:
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_two <- se_clustered(m5_two)
# Export results
stargazer(m5_two,
se = list(se_m5_two),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 ('Two-Way' Settings)"),
covariate.labels = variable_labels)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 ('Two-Way' Settings)
## ------------------------------------------------------------------------------------
## Post-Treatment -5.166***
## (1.187)
##
## Post-Treatment*Ring 1 -5.621***
## (1.892)
##
## Post-Treatment*Ring 2 -1.582
## (1.294)
##
## Post-Treatment*Ring 3 -0.783
## (1.431)
##
## Post-Treatment*Ring 4 0.409
## (1.955)
##
## Average Monthly Temperature -0.158**
## (0.081)
##
## Total Monthly Rain -0.051***
## (0.006)
##
## Average Monthly Wind Speed -4.903***
## (0.384)
##
## Average Monthly Wind Direction -0.078***
## (0.026)
##
## Average Monthly Pressure -0.023
## (0.043)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.426
## Adjusted R2 0.415
## F Statistic 226.961*** (df = 10; 3059)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Still, there is an important change compared to the baseline model: we now obtain a considerably lower adjusted R-squared (it has dropped to around 0.40). Hence, this model seems to yield the same results as the initial one albeit with a lower degree of explainability.
Next, we want to ensure that our results are not dependent on the specification of our models by treatment ring. Recall that in Model 5 of Section 5 we created four mutually exclusive treatment variables (and thus interaction terms) so that each would contain a different set of stations in order to test the geographical scope of policy spillovers.
Now we take a slightly different approach and create treatment variables where each treatment ring contains the current ring as well as all other previous rings. For instance, the third treatment ring includes observations in Ring 3, Ring 2 and Ring 1.
In order to do that, we repeat the transformations we used before but now we create our treatment rings following this alternative specification.
# Merge pollution and weather data
did_df_s <- model_pollution_df %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df_s <- did_df_s %>%
mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))
# Create treatment dummies for each ring with alternative specification
did_df_s <- did_df_s %>%
mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated" | ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated" | ring == "Ring 3: Treated" | ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0))
# Create interaction terms
did_df_s <- did_df_s %>%
mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
# Transform new variables into factors
did_df_s <- did_df_s %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_s$station_id <- as.factor(paste(did_df_s$town_code, did_df_s$station_code, sep="-"))
# Remove variables
did_df_s <- did_df_s %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
# Convert month variables to strings
did_df_s$month <- as.character(did_df_s$month)
We now run our most complete model:
# Set up the fixed effects model
m5_s <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = c("station_id"),
data = did_df_s)
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_s <- se_clustered(m5_s)
# Export results
stargazer(m5_s, se = list(se_m5_s),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Alternative Treatment Grouping)"),
omit = exclude_vars,
covariate.labels = variable_labels)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (Alternative Treatment Grouping)
## ------------------------------------------------------------------------------------
## Post-Treatment -5.166***
## (1.189)
##
## Post-Treatment*Ring 1 -4.039***
## (1.550)
##
## Post-Treatment*Ring 2 -0.799
## (0.916)
##
## Post-Treatment*Ring 3 -1.192
## (1.749)
##
## Post-Treatment*Ring 4 0.409
## (1.959)
##
## Average Monthly Temperature -0.158*
## (0.081)
##
## Total Monthly Rain -0.051***
## (0.006)
##
## Average Monthly Wind Speed -4.903***
## (0.385)
##
## Average Monthly Wind Direction -0.078***
## (0.026)
##
## Average Monthly Pressure -0.023
## (0.043)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.752
## Adjusted R2 0.747
## F Statistic 442.305*** (df = 21; 3059)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Our estimates do not seem to be very different from the ones in the original model, although (for instance) the treatment effect for Ring 1 stations is lower in absolute magnitude in this new specification.
We can repeat our plot with the confidence intervals for each (new) treatment ring to more easily assess whether the new specification leads to largely differing results:
# Extract coefficient estimates and standard errors
coefficients <- coef(m5_s)
se <- sqrt(diag(vcovHC(m5_s)))
# Create a dataframe for plotting
df_plot <- data.frame(
treatment_ring = c("Ring 1 (0 - 2.5km)", "Ring 2 (0 - 8km)", "Ring 3 (0 - 15km)", "Ring 4 (0 - 20km)"),
coefficient = coefficients[grep("POST_treat_", names(coefficients))],
SE = se[grep("POST_treat_", names(se))])
# Plot the coefficient estimates with confidence intervals
ggplot(df_plot, aes(x = treatment_ring, y = coefficient)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
geom_errorbar(aes(ymin = coefficient - 1.96 * SE,
ymax = coefficient + 1.96 * SE),
width = 0.2) +
labs(x = "Treatment Ring",
y = "Estimate (95% Confidence Interval)",
title = "Treatment Effect Estimates by Ring (Alternative Specification)") +
ylim(-10, 5) +
theme_light()
Indeed, the estimates are very similar. The main difference is the lower variability in Ring 2 and the fact that the estimated coefficient for Ring 3 is actually more negative (i.e. higher in absolute value) than that for Ring 2. Despite the lack of statistical significance for both, this result raises some questions as to why including stations farther away from the city centre into the treatment group strengthens the treatment effect. This could also be proof that our original specification may lead to more accurate results than the alternative.
Models 3 to 5 effectively capture the treatment effect for each of the four rings while controlling for the other rings included in the model. That is, the effectiveness of the policy for the stations in each ring depends on the inclusion of the other ring(s).
Yet, just like we did for Model 2 and Ring 1 stations, we can try to run different models where we only include one ring as the treatment group (aside from time and unit fixed effects and the weather controls).
We reproduce Model 2 for comparability and we do the same with Rings 2, 3 and 4:
# Set up model for only Ring 1 stations (same as Model 2, or 'm2' in Section 5)
m2_r1 <- plm(pollution_level ~ POST + POST_treat_1 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
clustered_test_m2_r1 <- coeftest(m2_r1, vcov = vcovHC(m2_r1, type = "HC1", cluster = "group"))
print(clustered_test_m2_r1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -5.9473473 0.4577064 -12.9938 < 2.2e-16 ***
## POST_treat_11 -4.8299489 1.5337223 -3.1492 0.001653 **
## avg_temperature_monthly -0.1435042 0.0829049 -1.7310 0.083561 .
## total_rain_monthly -0.0513408 0.0056766 -9.0443 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9349604 0.3855256 -12.8006 < 2.2e-16 ***
## wind_direction_monthly -0.0750806 0.0246257 -3.0489 0.002317 **
## avg_pressure_monthly -0.0230897 0.0426763 -0.5410 0.588518
## month10 -8.0155215 0.9742748 -8.2272 2.802e-16 ***
## month11 -4.9007961 0.5286663 -9.2701 < 2.2e-16 ***
## month12 -2.1510398 0.2750942 -7.8193 7.251e-15 ***
## month2 -6.8213455 0.4013852 -16.9945 < 2.2e-16 ***
## month3 -11.9980392 0.6871073 -17.4617 < 2.2e-16 ***
## month4 -18.4655908 0.9147885 -20.1856 < 2.2e-16 ***
## month5 -21.3357585 1.2068775 -17.6785 < 2.2e-16 ***
## month6 -21.0576208 1.5158347 -13.8918 < 2.2e-16 ***
## month7 -19.7359705 1.8265641 -10.8050 < 2.2e-16 ***
## month8 -21.4271853 1.7942757 -11.9420 < 2.2e-16 ***
## month9 -13.9027945 1.3480227 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 2 stations
m2_r2 <- plm(pollution_level ~ POST + POST_treat_2 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
clustered_test_m2_r2 <- coeftest(m2_r2, vcov = vcovHC(m2_r2, type = "HC1", cluster = "group"))
print(clustered_test_m2_r2)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.2897098 0.6463694 -9.7308 < 2.2e-16 ***
## POST_treat_21 -0.4295935 0.8258850 -0.5202 0.6029887
## avg_temperature_monthly -0.1627580 0.0813332 -2.0011 0.0454670 *
## total_rain_monthly -0.0521784 0.0057062 -9.1441 < 2.2e-16 ***
## avg_wind_speed_monthly -5.0202094 0.3971890 -12.6393 < 2.2e-16 ***
## wind_direction_monthly -0.0768988 0.0233469 -3.2937 0.0009999 ***
## avg_pressure_monthly -0.0294968 0.0430448 -0.6853 0.4932330
## month10 -7.8441129 0.9773790 -8.0257 1.425e-15 ***
## month11 -4.8329441 0.5264660 -9.1800 < 2.2e-16 ***
## month12 -2.1404266 0.2777896 -7.7052 1.752e-14 ***
## month2 -6.7627374 0.4038861 -16.7442 < 2.2e-16 ***
## month3 -11.8720350 0.6914816 -17.1690 < 2.2e-16 ***
## month4 -18.2963620 0.9184871 -19.9201 < 2.2e-16 ***
## month5 -21.0898650 1.2127485 -17.3901 < 2.2e-16 ***
## month6 -20.7190665 1.5242982 -13.5925 < 2.2e-16 ***
## month7 -19.3200218 1.8308996 -10.5522 < 2.2e-16 ***
## month8 -21.0509578 1.7872776 -11.7782 < 2.2e-16 ***
## month9 -13.6196706 1.3447034 -10.1284 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 3 stations
m2_r3 <- plm(pollution_level ~ POST + POST_treat_3 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
clustered_test_m2_r3 <- coeftest(m2_r3, vcov = vcovHC(m2_r3, type = "HC1", cluster = "group"))
print(clustered_test_m2_r3)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.671844 0.578949 -11.5241 < 2.2e-16 ***
## POST_treat_31 0.736277 0.998027 0.7377 0.460733
## avg_temperature_monthly -0.159509 0.081911 -1.9473 0.051585 .
## total_rain_monthly -0.052151 0.005633 -9.2582 < 2.2e-16 ***
## avg_wind_speed_monthly -5.009075 0.390329 -12.8329 < 2.2e-16 ***
## wind_direction_monthly -0.078973 0.025817 -3.0589 0.002241 **
## avg_pressure_monthly -0.028879 0.042978 -0.6720 0.501661
## month10 -7.872190 0.984802 -7.9937 1.838e-15 ***
## month11 -4.844132 0.529705 -9.1450 < 2.2e-16 ***
## month12 -2.141030 0.280721 -7.6269 3.189e-14 ***
## month2 -6.773702 0.404481 -16.7467 < 2.2e-16 ***
## month3 -11.892950 0.694112 -17.1340 < 2.2e-16 ***
## month4 -18.324168 0.918117 -19.9584 < 2.2e-16 ***
## month5 -21.133339 1.212587 -17.4283 < 2.2e-16 ***
## month6 -20.778431 1.527542 -13.6025 < 2.2e-16 ***
## month7 -19.396870 1.837156 -10.5581 < 2.2e-16 ***
## month8 -21.123098 1.785349 -11.8314 < 2.2e-16 ***
## month9 -13.670957 1.348851 -10.1353 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 4 stations
m2_r4 <- plm(pollution_level ~ POST + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df)
clustered_test_m2_r4 <- coeftest(m2_r4, vcov = vcovHC(m2_r4, type = "HC1", cluster = "group"))
print(clustered_test_m2_r4)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.6054301 0.4871396 -13.5596 < 2.2e-16 ***
## POST_treat_41 1.8551707 1.6330799 1.1360 0.256048
## avg_temperature_monthly -0.1631823 0.0819349 -1.9916 0.046503 *
## total_rain_monthly -0.0522365 0.0056598 -9.2294 < 2.2e-16 ***
## avg_wind_speed_monthly -5.0174129 0.3921263 -12.7954 < 2.2e-16 ***
## wind_direction_monthly -0.0767009 0.0235764 -3.2533 0.001153 **
## avg_pressure_monthly -0.0295190 0.0429969 -0.6865 0.492426
## month10 -7.8381524 0.9790951 -8.0055 1.673e-15 ***
## month11 -4.8299127 0.5256390 -9.1887 < 2.2e-16 ***
## month12 -2.1389652 0.2764164 -7.7382 1.359e-14 ***
## month2 -6.7625943 0.4045067 -16.7181 < 2.2e-16 ***
## month3 -11.8711128 0.6952570 -17.0744 < 2.2e-16 ***
## month4 -18.2925094 0.9211973 -19.8573 < 2.2e-16 ***
## month5 -21.0857718 1.2155650 -17.3465 < 2.2e-16 ***
## month6 -20.7137488 1.5276395 -13.5593 < 2.2e-16 ***
## month7 -19.3123452 1.8377630 -10.5086 < 2.2e-16 ***
## month8 -21.0428324 1.7883474 -11.7666 < 2.2e-16 ***
## month9 -13.6132188 1.3493920 -10.0884 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the net effects of Madrid Central on the average air pollution levels in Ring 1, Ring 2, Ring 3 and Ring 4 stations are -4.83, -0.43, 0.74 and 1.85, respectively.
It is useful to compare these coefficients with the estimates we obtained for each ring in Model 5 (i.e. model including all treated stations): -5.6, -1.6, -0.8 and 0.4, respectively. Just like we saw in the complete model, the pollution reduction due to the LEZ is significant for Ring 1 but not for Ring 2 and the rest. However, we now see that for these first two rings, the coefficients are larger (in absolute value) in the model that controls for all stations than those in the models estimating the effect for each ring individually. This could be due to the existence of (negative) spillover effects from the LEZ beyond its immediate proximity. In fact, without clustered standard errors, the policy effect for Ring 2 stations turned out to be significant (and negative) in Models 3, 4 and 5 of our primary regression exercise. However, it is more likely that the fluctuations in coefficient magnitudes are due to the smaller sample size when running each model separately.
For Ring 3 stations, however, the coefficient in the “net effect model” is positive while it was negative in the model with all stations (still insignificant). This suggests that there might be negative spillover effects from the LEZ counteracting positive ones observed when analyzing this group of stations independently. Confounding factors, spatial interactions or (again) simply the sample size could be behind this “incongruence” of the sign of treatment effects between the two models.
Finally, the coefficient on Ring 4 stations when considering them in isolation is higher than the one for those stations in the complete model, with both of them being positive and insignificant. This is the opposite effect as observed for Ring 1 and Ring 2 stations. While this could indicate that the LEZ’s impact on pollution levels is relatively localized in Ring 4, the fact that the change in pollution in these stations shows such low statistical significance renders sample size the most likely driver of this variation.
To close off this section, in order to ensure that the significant effect of the policy on the pollution levels of Ring 1 stations found in both sets of models (“complete” and “individual”) does not exclusively hinge on the impact on the only station inside the LEZ (Plaza del Carmen), we re-run the model on the first treatment ring excluding Plaza del Carmen:
# Merge pollution and weather data
did_df_i <- model_pollution_df %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
# Create post-treatment dummy
did_df_i <- did_df_i %>%
mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))
# Create treatment dummies for each ring, excluding Plaza del Carmen from the first ("station_id = 79-35")
did_df_i <- did_df_i %>%
mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated" & !(town_code == 79 & station_code == 35),
1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))
# Create interaction terms
did_df_i <- did_df_i %>%
mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
# Transform new variables into factors
did_df_i <- did_df_i %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_i$station_id <- as.factor(paste(did_df_i$town_code, did_df_i$station_code, sep="-"))
# Remove variables
did_df_i <- did_df_i %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
# Convert month variables to strings
did_df_i$month <- as.character(did_df_i$month)
# Set up model for only Ring 1 stations, excluding Plaza del Carmen
m2_r1_pc <- plm(pollution_level ~ POST + POST_treat_1 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df_i)
clustered_test_m2_r1_pc <- coeftest(m2_r1_pc, vcov = vcovHC(m2_r1_pc, type = "HC1", cluster = "group"))
print(clustered_test_m2_r1_pc)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.1168900 0.4695665 -13.0267 < 2.2e-16 ***
## POST_treat_11 -4.1258309 1.9179302 -2.1512 0.031539 *
## avg_temperature_monthly -0.1487349 0.0825636 -1.8015 0.071729 .
## total_rain_monthly -0.0516428 0.0056438 -9.1504 < 2.2e-16 ***
## avg_wind_speed_monthly -4.9690367 0.3869532 -12.8414 < 2.2e-16 ***
## wind_direction_monthly -0.0750636 0.0243207 -3.0864 0.002044 **
## avg_pressure_monthly -0.0253518 0.0427984 -0.5924 0.553658
## month10 -7.9713979 0.9764232 -8.1639 4.689e-16 ***
## month11 -4.8835479 0.5269654 -9.2673 < 2.2e-16 ***
## month12 -2.1518432 0.2760787 -7.7943 8.805e-15 ***
## month2 -6.8022806 0.4025034 -16.8999 < 2.2e-16 ***
## month3 -11.9572639 0.6883657 -17.3705 < 2.2e-16 ***
## month4 -18.4144112 0.9132709 -20.1631 < 2.2e-16 ***
## month5 -21.2650737 1.2067274 -17.6221 < 2.2e-16 ***
## month6 -20.9619615 1.5166982 -13.8208 < 2.2e-16 ***
## month7 -19.6194279 1.8310187 -10.7150 < 2.2e-16 ***
## month8 -21.3224146 1.7949011 -11.8794 < 2.2e-16 ***
## month9 -13.8261309 1.3462614 -10.2700 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The treatment effect is still significant for Ring 1 stations excluding Plaza del Carmen (Plaza de España, Escuelas Aguirre and Parque del Retiro), thereby confirming the robustness of the spatially-bound spillover effect found above. The estimated treatment effect is now smaller in absolute value because the one found in the individual model above for Ring 1 stations encompassed the effect on Plaza del Carmen and thus captured some of the variability (i.e. decrease) in pollution in the policy centroid.
We summarise all five models below, with the second column indicating the results of the last model (Ring 1 stations except Plaza del Carmen):
# Create the list of four models to include in the regression table
models_r <- list(m2_r1, m2_r1_pc, m2_r2, m2_r3, m2_r4)
# Generate regression table for 5 different models
stargazer(models_r,
se = lapply(models_r, se_clustered),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Ring 1", "Ring 1 w/o P.C.", "Ring 2", "Ring 3", "Ring 4"),
omit = exclude_vars,
covariate.labels = variable_labels)
##
## Regression Results
## ==========================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------------
## Ring 1 Ring 1 w/o P.C. Ring 2 Ring 3 Ring 4
## (1) (2) (3) (4) (5)
## ------------------------------------------------------------------------------------------
## Post-Treatment -5.947*** -6.117*** -6.290*** -6.672*** -6.605***
## (0.458) (0.470) (0.646) (0.579) (0.487)
##
## Post-Treatment*Ring 1 -4.830*** -4.126**
## (1.534) (1.918)
##
## Post-Treatment*Ring 2 -0.430
## (0.826)
##
## Post-Treatment*Ring 3 0.736
## (0.998)
##
## Post-Treatment*Ring 4 1.855
## (1.633)
##
## Average Monthly Temperature -0.144* -0.149* -0.163** -0.160* -0.163**
## (0.083) (0.083) (0.081) (0.082) (0.082)
##
## Total Monthly Rain -0.051*** -0.052*** -0.052*** -0.052*** -0.052***
## (0.006) (0.006) (0.006) (0.006) (0.006)
##
## Average Monthly Wind Speed -4.935*** -4.969*** -5.020*** -5.009*** -5.017***
## (0.386) (0.387) (0.397) (0.390) (0.392)
##
## Average Monthly Wind Direction -0.075*** -0.075*** -0.077*** -0.079*** -0.077***
## (0.025) (0.024) (0.023) (0.026) (0.024)
##
## Average Monthly Pressure -0.023 -0.025 -0.029 -0.029 -0.030
## (0.043) (0.043) (0.043) (0.043) (0.043)
##
## ------------------------------------------------------------------------------------------
## Observations 3,120 3,120 3,120 3,120 3,120
## R2 0.752 0.750 0.748 0.748 0.749
## Adjusted R2 0.747 0.745 0.743 0.744 0.744
## F Statistic (df = 18; 3062) 514.578*** 510.393*** 505.235*** 505.611*** 506.457***
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can also re-run Model 5 (“complete”) on the four treatment rings, with the first ring excluding Plaza del Carmen to assess to what extent the significance of our results in Model 5 depends upon the presence of the policy centroid.
# Set up model for all rings, with Ring 1 excluding Plaza del Carmen
m5_pc <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = c("station_id"),
data = did_df_i)
# T-test with clustered standard errors
clustered_test_m5_pc <- coeftest(m5_pc, vcov = vcovHC(m5_pc, type = "HC1", cluster = "group"))
print(clustered_test_m5_pc)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -6.184982 1.374174 -4.5009 7.020e-06 ***
## POST_treat_11 -4.058568 2.343204 -1.7321 0.083364 .
## POST_treat_21 -0.550210 1.493939 -0.3683 0.712679
## POST_treat_31 0.246594 1.610832 0.1531 0.878341
## POST_treat_41 1.437687 2.090946 0.6876 0.491771
## avg_temperature_monthly -0.159357 0.081034 -1.9666 0.049325 *
## total_rain_monthly -0.051681 0.005659 -9.1326 < 2.2e-16 ***
## avg_wind_speed_monthly -4.941223 0.387999 -12.7351 < 2.2e-16 ***
## wind_direction_monthly -0.080064 0.025421 -3.1496 0.001651 **
## avg_pressure_monthly -0.025496 0.043082 -0.5918 0.554032
## month10 -7.858674 0.965462 -8.1398 5.699e-16 ***
## month11 -4.836986 0.521565 -9.2740 < 2.2e-16 ***
## month12 -2.120219 0.276202 -7.6763 2.187e-14 ***
## month2 -6.792014 0.400510 -16.9584 < 2.2e-16 ***
## month3 -11.932414 0.685254 -17.4131 < 2.2e-16 ***
## month4 -18.355180 0.908584 -20.2020 < 2.2e-16 ***
## month5 -21.156826 1.199360 -17.6401 < 2.2e-16 ***
## month6 -20.801995 1.503309 -13.8375 < 2.2e-16 ***
## month7 -19.416795 1.807609 -10.7417 < 2.2e-16 ***
## month8 -21.134841 1.775233 -11.9054 < 2.2e-16 ***
## month9 -13.669269 1.331756 -10.2641 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unsurprisingly, the coefficients do not vary much but the treatment effect in Ring 1 (three stations closest to Plaza del Carmen) is now only significant at the 0.1 significance level. Moreover, the positive spillovers seem to be higher in fourth-ring stations when Plaza del Carmen is excluded. This means that, although the policy clearly has beneficial effects in terms of pollution reduction in the vicinity of the LEZ, the decrease is much stronger inside than immediately outside of it.
Again, more than a robustness check, we want to make sure our unit (station) fixed effects are capturing the variability in pollution levels that arises due to station-to-station differences. Perhaps the type of station contributes significantly to this variation so that the inability of fixed effects to cover it would yield biased ATT estimates.
Recall that there are five types of stations in our model:
did_df %>% distinct(station_type) %>% pull()
## [1] "Urbana trafico" "Urbana fondo" "Suburbana (ciudad)"
## [4] "Suburbana fondo" "Urbana industrial"
We re-run Model 5 while controlling for the type of station:
# Set up the fixed effects model
m5_st <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month + as.factor(station_type),
model = "within", index = c("station_id"),
data = did_df)
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_st <- se_clustered(m5_st)
# Export the results
stargazer(m5_st,
se = list(se_m5_st),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Station Type)"),
omit = exclude_vars,
covariate.labels = variable_labels)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (Station Type)
## ------------------------------------------------------------------------------------
## Post-Treatment -5.166***
## (1.189)
##
## Post-Treatment*Ring 1 -5.621***
## (1.896)
##
## Post-Treatment*Ring 2 -1.582
## (1.296)
##
## Post-Treatment*Ring 3 -0.783
## (1.433)
##
## Post-Treatment*Ring 4 0.409
## (1.959)
##
## Average Monthly Temperature -0.158*
## (0.081)
##
## Total Monthly Rain -0.051***
## (0.006)
##
## Average Monthly Wind Speed -4.903***
## (0.385)
##
## Average Monthly Wind Direction -0.078***
## (0.026)
##
## Average Monthly Pressure -0.023
## (0.043)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.752
## Adjusted R2 0.747
## F Statistic 442.305*** (df = 21; 3059)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Fortunately for the sake of our chosen model, the estimated coefficients, standard errors and significance levels for all our variables are identical to those in the model without the station type control (Section 5). Note that, by indexing by station, we are already accounting for differences related to the type of station (station location, primarily). This renders the station type control redundant and thus does not even appear in the model.
Between April and November of 2018, the City Council undertook the comprehensive remodelling of Gran Via. Since this station is the main avenue crossing the LEZ, we want to know if controlling for it (and thus for one potential confounder as seen in our DAG above), makes any difference for our estimates.
Once again, we load our data and create our treatment dummies like we did before.
# Merge pollution and weather data
did_df_gv <- model_pollution_df %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df_gv <- did_df_gv %>%
mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))
# Create treatment dummies for each ring
did_df_gv <- did_df_gv %>%
mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))
# Create interaction terms by multiplying POST and each corresponding treatment ring
did_df_gv <- did_df_gv %>%
mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
# Transform new variables into factors
did_df_gv <- did_df_gv %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_gv$station_id <- as.factor(paste(did_df_gv$town_code, did_df_gv$station_code, sep="-"))
# Remove variables
did_df_gv <- did_df_gv %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
Then, after transforming our “month” variable to integer values to allow for more reliable computations, we create a dummy for the remodelling of Gran Via which takes a value of 1 for observations between April and November of 2018 (both included), and 0 otherwise.
# Convert month variable to integers and create dummy for Gran Via public works
did_df_gv <- did_df_gv %>%
mutate(month = as.integer(month)) %>%
mutate(GRAN_VIA = as.factor(ifelse(year == 2018 & month >= 4 & month <= 11, 1, 0)))
# Convert month variables to strings before modelling
did_df_gv$month <- as.character(did_df_gv$month)
We run our model:
# Set up the fixed effects model with Gran Via control
m5_gv <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month + as.factor(GRAN_VIA),
model = "within", index = c("station_id"),
data = did_df_gv)
# Specify variable names
variable_labels_gv <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"POST_treat_21" = "Post-Treatment*Ring 2",
"POST_treat_31" = "Post-Treatment*Ring 3",
"POST_treat_41" = "Post-Treatment*Ring 4",
"avg_temperature_monthly" = "Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure",
"as.factor(GRAN_VIA)1" = "Gran Via")
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_gv <- se_clustered(m5_gv)
# Export results
stargazer(m5_gv,
se = list(se_m5_gv),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Gran Via Public Works)"),
omit = exclude_vars,
covariate.labels = variable_labels_gv)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (Gran Via Public Works)
## ------------------------------------------------------------------------------------
## Post-Treatment -5.701***
## (1.226)
##
## Post-Treatment*Ring 1 -5.573***
## (1.906)
##
## Post-Treatment*Ring 2 -1.600
## (1.308)
##
## Post-Treatment*Ring 3 -0.759
## (1.445)
##
## Post-Treatment*Ring 4 0.434
## (1.968)
##
## Average Monthly Temperature -0.197**
## (0.080)
##
## Total Monthly Rain -0.050***
## (0.005)
##
## Average Monthly Wind Speed -5.049***
## (0.376)
##
## Average Monthly Wind Direction -0.093***
## (0.026)
##
## Average Monthly Pressure -0.031
## (0.043)
##
## Gran Via -3.592***
## (0.412)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.759
## Adjusted R2 0.755
## F Statistic 438.723*** (df = 22; 3058)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimates change slightly compared to the original model but not significantly. However, we see the variable ‘GRAN_VIA’ (or ‘Gran Via Remodelling’ in the regression table) is actually significant. The presence of construction works in Gran Via during the 8 months prior to the introduction of the LEZ had an average effect on the pollution level of around -3.6 µg/m3 (compared to its absence), holding all other variables in the model constant. This means that there is a significant overall decrease in average pollution in Madrid associated with this particular instance of public works.
Yet, this does not say anything about the pollution level in the first ring (which includes Gran Via street), since we cannot interact this dummy with ‘POST_treat_1’ as there is no time overlap between the implementation of the LEZ and the presence of construction work.
Hence, we refrain from including this variable in our main models since the estimates are basically the same and the combination of unit fixed effects (for Plaza del Carmen in this case) with month fixed effects should take care of (a lot of) this variability anyway.
Along the lines of the previous “robustness check”, we want to know if the traffic restrictions implemented to contain the spread of the COVID-19 pandemic in 2020 had any impact on the estimates of the policy effects found above. As a quick reminder, the Spanish national government declared the State of Alarm (or State of Emergency) on March 14th 2020 and lifted it up on June 21st of that same year. This measure entailed, primarily, a full lockdown with very few exceptions (e.g. grocery shopping, work in essential sectors at first, etc.). Hence, it constituted an unprecedented disruption of normal traffic flows and presumably of pollution levels.
Theoretically, month and station fixed effects should capture all variability triggered by this event as all locations were probably affected very similarly by the restrictions. Still, we perform the same steps as with the Gran Via control and run the model at the end of this chunk:
# Merge pollution and weather data
did_df_covid <- model_pollution_df %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df_covid <- did_df_covid %>%
mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))
# Create treatment dummies for each ring
did_df_covid <- did_df_covid %>%
mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))
# Create interaction terms by multiplying POST and each corresponding treatment ring
did_df_covid <- did_df_covid %>%
mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
POST_treat_2 = POST*as.numeric(treatment_ring2),
POST_treat_3 = POST*as.numeric(treatment_ring3),
POST_treat_4 = POST*as.numeric(treatment_ring4))
# Transform new variables into factors
did_df_covid <- did_df_covid %>%
mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_covid$station_id <- as.factor(paste(did_df_covid$town_code, did_df_covid$station_code, sep="-"))
# Remove variables
did_df_covid <- did_df_covid %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
# Convert month variable to integers and create dummy for 'State of Alarm'
did_df_covid <- did_df_covid %>%
mutate(month = as.integer(month)) %>%
mutate(COVID = as.factor(ifelse(year == 2020 & month >= 3 & month <= 6, 1, 0)))
# Convert month variables to strings before modelling
did_df_covid$month <- as.character(did_df_covid$month)
# Set up the fixed effects model with COVID-19 control
m5_covid <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month + as.factor(COVID),
model = "within", index = c("station_id"),
data = did_df_covid)
# Specify variable names
variable_labels_covid <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"POST_treat_21" = "Post-Treatment*Ring 2",
"POST_treat_31" = "Post-Treatment*Ring 3",
"POST_treat_41" = "Post-Treatment*Ring 4",
"avg_temperature_monthly" = "Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure",
"as.factor(COVID)1" = "COVID-19")
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_covid <- se_clustered(m5_covid)
# Export results
stargazer(m5_covid,
se = list(se_m5_covid),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (COVID-19)"),
omit = exclude_vars,
covariate.labels = variable_labels_covid)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (COVID-19)
## ------------------------------------------------------------------------------------
## Post-Treatment -4.036***
## (1.214)
##
## Post-Treatment*Ring 1 -5.607***
## (1.909)
##
## Post-Treatment*Ring 2 -1.474
## (1.312)
##
## Post-Treatment*Ring 3 -0.838
## (1.459)
##
## Post-Treatment*Ring 4 0.332
## (1.998)
##
## Average Monthly Temperature -0.011
## (0.081)
##
## Total Monthly Rain -0.041***
## (0.005)
##
## Average Monthly Wind Speed -5.267***
## (0.421)
##
## Average Monthly Wind Direction -0.056**
## (0.027)
##
## Average Monthly Pressure -0.024
## (0.043)
##
## COVID-19 -8.441***
## (0.648)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.771
## Adjusted R2 0.767
## F Statistic 469.271*** (df = 22; 3058)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As it was the case with the model in the immediately preceding section, the coefficient of the dummy is statistically significant but the treatment effect estimates have not changed much since its inclusion. This indicates that, as we predicted, model fixed effects may actually be enough to account for pollution variation caused by an external event but occurring within a set window of time and affecting all stations more or less equivalently.
The next check entails including year, rather than month, fixed effects in our model. In theory, month fixed effects should perform the same function as year fixed effects by capturing time-related variability in pollution levels affecting all stations equally, yet in a more nuanced way.
Hence we simply run Model 5 again but control for the year rather than the month:
# Convert year variables to strings
did_df$year <- as.character(did_df$year)
# Set up the fixed effects model
m5_y <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
year,
model = "within", index = c("station_id"),
data = did_df)
# Specify variable names
variable_labels_y <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"POST_treat_21" = "Post-Treatment*Ring 2",
"POST_treat_31" = "Post-Treatment*Ring 3",
"POST_treat_41" = "Post-Treatment*Ring 4",
"avg_temperature_monthly" = "Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure",
"year2016" = "2016",
"year2017" = "2017",
"year2018" = "2018",
"year2019" = "2019",
"year2020" = "2020",
"year2021" = "2021")
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_y <- se_clustered(m5_y)
# Export results
stargazer(m5_y,
se = list(se_m5_y),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Year Fixed Effects)"),
covariate.labels = variable_labels_y)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (Year Fixed Effects)
## ------------------------------------------------------------------------------------
## Post-Treatment -0.449
## (1.582)
##
## Post-Treatment*Ring 1 -5.092**
## (2.026)
##
## Post-Treatment*Ring 2 -1.602
## (1.478)
##
## Post-Treatment*Ring 3 -1.110
## (1.589)
##
## Post-Treatment*Ring 4 0.268
## (2.203)
##
## Average Monthly Temperature -1.036***
## (0.031)
##
## Total Monthly Rain -0.059***
## (0.007)
##
## Average Monthly Wind Speed -7.345***
## (0.464)
##
## Average Monthly Wind Direction 0.013
## (0.024)
##
## Average Monthly Pressure 0.019
## (0.051)
##
## 2016 -1.096***
## (0.420)
##
## 2017 0.280
## (0.425)
##
## 2018 -3.428***
## (0.560)
##
## 2019 -1.688
## (1.229)
##
## 2020 -8.832***
## (1.043)
##
## 2021 -9.645***
## (1.220)
##
## ------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.684
## Adjusted R2 0.678
## F Statistic 413.750*** (df = 16; 3064)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We see that the coefficient estimates for the interaction terms are quite similar to the model with month fixed effects. Interestingly, the ‘POST’ variable loses significance and its coefficient changes from around -5 to -0.5. This means that this model does not find a significant change in the average pollution level of stations in the control group in the post-treatment period unlike the model with month fixed effects.
There could be several explanations for this. Most clearly, since only month fixed effects capture seasonal variation in pollution levels, the significant and larger estimate for the model including them suggests that there is actually a significant difference in pollution levels throughout the year which this new model misses.
Another important difference with respect to the month-fixed effects model is that the (adjusted) R-squared has dropped from around 0.75 to 0.67. Hence, month fixed effects do seem to account for more variability in the outcome than year fixed effects.
Finally, to address potential non-linearities in the effect of the weather covariates on the pollution level, we include their squared terms in our regression for Model 5:
# Set up the fixed effects model with quadratic weather controls
m5_quad <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
avg_temperature_monthly + I(avg_temperature_monthly^2) +
total_rain_monthly + I(total_rain_monthly^2) +
avg_wind_speed_monthly + I(avg_wind_speed_monthly^2) +
wind_direction_monthly + I(wind_direction_monthly^2) +
avg_pressure_monthly + I(avg_pressure_monthly^2) +
month,
model = "within", index = c("station_id"),
data = did_df)
# Rename variables
variable_labels_q <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"POST_treat_21" = "Post-Treatment*Ring 2",
"POST_treat_31" = "Post-Treatment*Ring 3",
"POST_treat_41" = "Post-Treatment*Ring 4",
"avg_temperature_monthly" = "Average Monthly Temperature",
"I(avg_temperature_monthly2)" = "Squared Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"I(total_rain_monthly2)" = "Squared Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"I(avg_wind_speed_monthly2)" = "Squared Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"I(wind_direction_monthly2)" = "Squared Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure",
"I(avg_pressure_monthly2)" = "Squared Average Monthly Pressure")
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_quad <- se_clustered(m5_quad)
# Export results
stargazer(m5_quad,
se = list(se_m5_quad),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 5 (Quadratic Specification)"),
omit = exclude_vars,
covariate.labels = variable_labels_q)
##
## Regression Results
## ============================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 5 (Quadratic Specification)
## --------------------------------------------------------------------------------------------
## Post-Treatment -5.254***
## (1.193)
##
## Post-Treatment*Ring 1 -4.942***
## (1.894)
##
## Post-Treatment*Ring 2 -1.089
## (1.315)
##
## Post-Treatment*Ring 3 -0.497
## (1.435)
##
## Post-Treatment*Ring 4 0.633
## (1.933)
##
## Average Monthly Temperature 0.248
## (0.306)
##
## Squared Average Monthly Temperature -0.013*
## (0.008)
##
## Total Monthly Rain -0.079***
## (0.006)
##
## Squared Total Monthly Rain 0.0002***
## (0.00002)
##
## Average Monthly Wind Speed -9.985***
## (1.155)
##
## Squared Average Monthly Wind Speed 0.876***
## (0.190)
##
## Average Monthly Wind Direction 0.009
## (0.090)
##
## Squared Average Monthly Wind Direction -0.001
## (0.001)
##
## Average Monthly Pressure -4.401
## (3.497)
##
## Squared Average Monthly Pressure 0.002
## (0.002)
##
## --------------------------------------------------------------------------------------------
## Observations 3,120
## R2 0.763
## Adjusted R2 0.758
## F Statistic 377.668*** (df = 26; 3054)
## ============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
There are three main insights gathered from the quadratic specification.
First, the coefficients for the meteorological covariates in their linear form diverge from those assigned to the quadratic terms. For instance, the square of total rain is significant and positive (coefficient = 0.0002) while total rain in its linear form has a negative (and also significant) effect on pollution (coefficient = -0.079). Hence, higher rainfall is associated with lower pollution levels but the effect of the former on the later seems to vary as the amount of rain increases. Concerning significance levels, none of the variables that were insignificant in the linear model gain significance in the quadratic one, implying non-linear effects are not that strong.
Second, the estimates for the effect of the policy for the first treatment ring is close but slightly lower in absolute value in the quadratic model. This change could be explained by potential non-linearities now captured by the squared terms or by multicollinearity due to the addition of variables containing the same information as those already in the model.
Third, the adjusted R-squared has barely increased compared to the original model, indicating that the addition of squared terms does not significantly improve the model fit.
Hence, our results are robust to the quadratic specification of the controls as only one quadratic term is significant and our treatment effect estimates are very close to the ones in the linear model.
Let’s falsify our results. We start by testing the significance of our treatment effect (within the 2.5-kilometre range from Plaza del Carmen since we found no further spillover effects) in years prior to the implementation of Madrid Central. This will allow us to assess whether the estimated treatment effect we found is driven by the treatment itself or rather by other factors that may be present in the data. In other words, for our treatment effect to be robust, we would like to find no statistical significance in the coefficient of our ‘POST_treat_1’ variable when we specify the treatment took place in the arbitrarily chosen date of November 2015.
We reproduce the steps used above, but this time with data from 2012 to June of 2018 to avoid any anticipatory effects going up to November of 2018 (date of actual implementation of the policy). We also only use the Model 2 specification from Section 5 as (again) it is only for the first treatment ring that we found statistical significance in the effect of the policy.
Pollution data for pollution monitoring stations inside the City of Madrid (2012 - 2018)
We load pollution hourly data for the city of Madrid for the years 2012 to 2018.
# `For` loop for hourly data for all months in all years
file_names_r <- c(paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo12.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo13.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo14.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo15.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo16.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo17.csv"),
paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo18.csv"))
# Create an empty dataframe to store the cleaned data
all_city_data_clean_r <- data.frame()
# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names_r) {
# Read the data
city_data_r <- read_delim(file_name, delim=";")
# Apply transformations
city_data_clean_r <- city_data_r %>%
select(-c(PROVINCIA, PUNTO_MUESTREO)) %>%
rename(town_code = MUNICIPIO,
station_code = ESTACION,
magnitude = MAGNITUD,
year = ANO,
month = MES,
day = DIA) %>%
select(-matches("^V")) %>%
pivot_longer(cols = starts_with("H"), names_to = "hour", values_to = "pollution_level") %>%
# Keep only NO2 measures and remove 'magnitude' column
filter(magnitude == "8") %>%
select(-magnitude) %>%
# Adjust variable types
mutate(month = as.numeric(month),
day = as.numeric(day),
hour = as.numeric(str_replace(hour, "H", "")),
# Remove leading zeros from 'code' variables
town_code = as.factor(str_remove(town_code, "^0+")),
station_code = as.factor(str_remove(station_code, "^0+")),
# Replace comma with point for 'pollution_level'
pollution_level = as.numeric(gsub(",", ".", pollution_level)))
# Append the cleaned data to the 'all_city_data_clean' dataframe
all_city_data_clean_r <- bind_rows(all_city_data_clean_r, city_data_clean_r)
}
# Aggregate the data to monthly level
city_monthly_r <- all_city_data_clean_r %>%
group_by(town_code, station_code, year, month) %>%
summarise(pollution_level = mean(pollution_level))
Pollution data for pollution monitoring stations outside the City of Madrid (2012 - 2018)
We do the same for pollution stations outside the city of Madrid.
# Create file names vector
file_names_r <- c(paste0("com_", c(2012:2018), ".csv"))
# Create an empty dataframe to store the cleaned data
all_com_data_clean_r <- data.frame()
# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names_r) {
# Read the data
com_data_r <- read_delim(file_name, delim=";")
# Apply transformations
com_data_clean_r <- com_data_r %>%
dplyr::select(-c(provincia, punto_muestreo)) %>%
rename(town_code = municipio,
station_code = estacion,
magnitude = magnitud,
year = ano,
month = mes,
day = dia) %>%
select(-matches("^V")) %>%
pivot_longer(cols = starts_with("h"), names_to = "hour", values_to = "pollution_level") %>%
# Keep only NO2 measures and remove 'magnitude' column
filter(magnitude == "8") %>%
select(-magnitude) %>%
# Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
mutate(month = as.numeric(month),
day = as.numeric(day),
hour = as.numeric(str_replace(hour, "h", "")),
town_code = as.factor(str_remove(town_code, "^0+")),
station_code = as.factor(str_remove(station_code, "^0+")),
pollution_level = as.numeric(gsub(",", ".", pollution_level)))
# Append the cleaned data to the 'all_com_data_clean' dataframe and drop NA
all_com_data_clean_r <- bind_rows(all_com_data_clean_r, com_data_clean_r) %>%
drop_na(pollution_level)
}
# Aggregate the data to monthly level
com_monthly_r <- all_com_data_clean_r %>%
group_by(town_code, station_code, year, month) %>%
summarise(pollution_level = mean(pollution_level))
Complete dataset of monthly pollution observations for the air quality network of the Community of Madrid
We merge both sets of data by row. We also merge in-City pollution data with in-City station data and out-of-City pollution data with out-of-City station data. We merge those two datasets as well.
# Bind potential treatment with control stations
full_data_r <- bind_rows(city_monthly_r, com_monthly_r)
# Join 'city_monthly' with 'city_stations' by 'town_code' and 'station_code'
city_data_r <- city_monthly_r %>%
left_join(city_stations_clean, by = c("town_code", "station_code"))
# Join 'com_monthly' with 'com_stations' by 'town_code' and 'station_code'
com_data_r <- com_monthly_r %>%
left_join(com_stations_clean, by = c("town_code", "station_code"))
# Combine both dataframes into a single one for the whole Community
full_data_complete_r <- bind_rows(city_data_r, com_data_r) %>%
select(-c(station_name, rural_area_type))
# Replace the missing values in 'town_name' by Madrid
full_data_complete_r$town_name <- ifelse(is.na(full_data_complete_r$town_name), "Madrid", full_data_complete_r$town_name)
Dataset with pollution data and station information necessary for modelling
We take the last dataset and join it with our previously created dataframe with new variables for each station. Just like we did for the estimation of our actual treatment effect (back then with June 2021), we only keep observations up to June 2018. We also remove Ring 6 stations.
# Merge data with monthly pollution averages and station data (including rings and nearby weather stations)
model_pollution_df_r <- full_data_complete_r %>%
left_join(data_all_stations, by = c("town_code", "station_code", "address", "station_type",
"longitude", "latitude", "distance_km")) %>%
ungroup()
# Remove data from July 2018 onwards
model_pollution_df_r <- model_pollution_df_r %>%
filter(year < 2018 | (year == 2018 & month < 7)) %>%
# Remove stations in "Ring 6: Excluded"
filter(ring != "Ring 6: Excluded")
# Remove any levels not present in the dataframe
model_pollution_df_r$ring <- droplevels(model_pollution_df_r$ring)
Weather data (2012 - 2018)
We load weather data for the new time window and make the appropriate transformations. We impute missing values and group by month just like we did before with the actual treatment period.
set.seed(1234)
# Obtain daily values for all stations in Madrid
weather_full_r <- aemet_daily_clim(station = "all", start = "2012-01-01", end = "2018-06-30",
verbose = FALSE, return_sf = FALSE) %>%
filter(provincia == "MADRID")
# Make transformations: select, adjust variable types and create new columns
weather_full_final_r <- weather_full_r %>%
select(date = fecha,
weather_station_name = nombre,
avg_temperature = tmed,
min_temperature = tmin,
max_temperature = tmax,
total_rain = prec,
direction_max_wind_speed = dir,
avg_wind_speed = velmedia,
min_pressure = presMin,
max_pressure = presMax) %>%
# Replace comma with point
mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
direction_max_wind_speed = as.numeric(direction_max_wind_speed),
# Obtain 'year' and 'month' columns from 'date' column
date = ymd(date),
year = year(date),
month = month(date)) %>%
select(-date)
# Replace all values of "88" in wind direction with NAs
weather_full_final_r$direction_max_wind_speed <- ifelse(weather_full_final_r$direction_max_wind_speed == "88", NA,
weather_full_final_r$direction_max_wind_speed)
# Specify columns to impute
cols_to_impute <- c("avg_temperature", "max_temperature", "min_temperature",
"total_rain", "direction_max_wind_speed", "avg_wind_speed",
"min_pressure", "max_pressure")
# Impute missing values in the columnsh using PMM
imputed_data <- mice(weather_full_final_r[cols_to_impute], m = 10, method = "pmm")
# Update original dataframe with imputed values
weather_full_final_r[cols_to_impute] <- complete(imputed_data)[, cols_to_impute]
# Group imputed weather variables and compute monthly averages
weather_full_final_r <- weather_full_final_r %>%
group_by(weather_station_name, year, month) %>%
summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
min_temperature_monthly = round(min(min_temperature), 2),
max_temperature_monthly = round(max(min_temperature), 2),
total_rain_monthly = round(sum(total_rain), 2),
wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))
Pre-modelling feature engineering
We create our dataset ‘did_df_r’. This includes pollution and weather data which we will use to run our model. We create the dummies that go into our regressions - time, treatment ring (again, only the first) and the interaction between the two - and transform them into factors. We make some more modifications just like we did above.
# Rename weather station variable so that it is the same in both datasets
model_pollution_df_r <- model_pollution_df_r %>%
rename(weather_station_name = closest_weather_station)
# Merge pollution and weather data
did_df_r <- model_pollution_df_r %>%
left_join(weather_full_final_r,
by = c("year", "month", "weather_station_name"))
# Create post-treatment, treatment and interaction terms
did_df_r <- did_df_r %>% mutate(
# Create post-treatment dummy (assuming implementation is in November 2015)
POST = ifelse(year > 2015 | (year == 2015 & month == 12), 1, 0),
# Create treatment dummy for ring 1
treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
# Create interaction terms by multiplying POST and treatment ring
POST_treat_1 = POST*as.numeric(treatment_ring1))
# Transform new variables into factors
did_df_r <- did_df_r %>%
mutate(across(any_of(c("POST", "treatment_ring1", "POST_treat_1")), as.factor))
# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_r$station_id <- as.factor(paste(did_df_r$town_code, did_df_r$station_code, sep="-"))
# Remove variables
did_df_r <- did_df_r %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))
# Convert month variables to strings
did_df_r$month <- as.character(did_df_r$month)
Modelling
We set up the model with the first treatment ring, the weather controls and the month and station fixed effects. We compute clustered standard errors.
# Set up the fixed effects model
m <- plm(pollution_level ~ POST + POST_treat_1 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_id",
data = did_df_r)
# Rename variables
variable_labels_r <- c(
"POST1" = "Post-Treatment",
"POST_treat_11" = "Post-Treatment*Ring 1",
"avg_temperature_monthly" = "Average Monthly Temperature",
"total_rain_monthly" = "Total Monthly Rain",
"avg_wind_speed_monthly" = "Average Monthly Wind Speed",
"wind_direction_monthly" = "Average Monthly Wind Direction",
"avg_pressure_monthly" = "Average Monthly Pressure")
# Calculate clustered standard errors to export results using `stargazer()`
se_m <- se_clustered(m)
# Export results
stargazer(m,
se = list(se_m),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
dep.var.labels.include = FALSE,
column.labels = c("Model 2 (Placebo Treatment Period)"),
omit = exclude_vars,
covariate.labels = variable_labels_r)
##
## Regression Results
## ====================================================================================
## Dependent Variable: Average Monthly NO2 Concentration
## -----------------------------------------------------
## Model 2 (Placebo Treatment Period)
## ------------------------------------------------------------------------------------
## Post-Treatment 2.141***
## (0.380)
##
## Post-Treatment*Ring 1 1.768
## (1.330)
##
## Average Monthly Temperature 0.118
## (0.077)
##
## Total Monthly Rain -0.050***
## (0.008)
##
## Average Monthly Wind Speed -4.207***
## (0.468)
##
## Average Monthly Wind Direction -0.078***
## (0.029)
##
## Average Monthly Pressure -0.013
## (0.039)
##
## ------------------------------------------------------------------------------------
## Observations 3,100
## R2 0.746
## Adjusted R2 0.741
## F Statistic 495.926*** (df = 18; 3042)
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Clearly, the interaction term lacks statistical significance, meaning the “fake treatment” we arbitrarily specified as taking place in November 2015 had no important effects on pollution level in the four stations closest to Plaza del Carmen. This contrasts with the “actual treatment” happening in November 2018, which did have a significant effect on pollution for the same four stations.
This suggests that the estimated treatment effect during the actual treatment period is more likely to be driven by the treatment itself rather than by other factors, which provides some evidence for the robustness of the results.
The second placebo test consists of reproducing our analysis above (same criteria for assigning stations into treatment and same period of intervention) but using an outcome that, in theory, should not be affected (or affected but to a much lower degree) by the treatment. Due to the availability of open data, we select noise pollution.
It is true that a potential association between the introduction of an LEZ and average noise levels cannot be completely discarded. Going back to our DAG, Madrid Central could have an effect on noise levels through the traffic redirection mechanism (i.e. fewer cars in the immediacy of Madrid Central, lower average noise levels). However, this effect is likely to be considerably smaller than the one on pollution levels due to (for instance) the neutralization of the fleet recomposition channel in this new identification strategy.
Noise pollution data inside the City of Madrid (2015 - 2021)
Hence, we download the data from the City Council’s Data Portal. Monthly data is available since 1998 only for the City of Madrid, which is enough for our purposes as we only found significant effects on pollution in the immediacy of the policy centroid.
We want to use data from 2015 to 2021 as we did for the main regression on pollution levels to ensure comparability. Yet, data is only available in CSV format since 2017 so we create a loop to read data for the last five years in our sample. We select the following variables: date, station code, station name and the noise level. Among all available indicators, we select “LAeq 24”, which captures the average long-term noise level in the area during all 24-hour periods of each month.
We perform some data cleaning and remove missing values due to the fact that they are few and evenly distributed through time and space.
# Create file names vector
file_names <- c(paste0("Anio", c(2017:2021), "_acus.csv"))
# Create an empty dataframe to store the cleaned data
all_acus_data_clean <- data.frame()
# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
# Read the data
acus_data <- read_delim(file_name, delim = ";")
# Apply transformations
acus_data_clean <- acus_data %>%
select(date = Fecha,
station_code = NMT,
station_name = Nombre,
avg_24_hours = LAeq24) %>%
# Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
mutate(station_code = as.factor(station_code),
avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours)))
# Append the cleaned data to the 'all_acus_data_clean' dataframe
all_acus_data_clean <- bind_rows(all_acus_data_clean, acus_data_clean)
}
# Create a vector of month abbreviations in the correct order
month_abbreviations <- c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic")
# Additional transformations
all_acus_data_clean <- all_acus_data_clean %>%
# Split the date column into month and year columns
mutate(month = as.numeric(match(tolower(substr(date, 1, 3)), tolower(month_abbreviations))),
year = as.numeric(paste0("20", substr(date, 5, 6)))) %>%
# Remove the original date column
select(-date) %>%
# Remove missing values
drop_na(avg_24_hours)
For years 2015 and 2016, data is only available in XLS format (with data for each month stored in separate sheets of the same workbook). We hence make use of Excel VBA to create a CSV file out of each of these sheets and set up another loop to read them. We start with 2016 data and apply similar transformations as we did above.
# Create a vector of month abbreviations in the correct order
months <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio",
"Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
# Create an empty dataframe to store the cleaned data
all_acus_data_clean_2016 <- data.frame()
# Loop through the months from January to December
for (month in months) {
# Construct the file name
file_name <- paste0("2016_", month, "_acus.csv")
# Read the data
acus_data_2016 <- read_csv(file_name)
# Apply transformations
acus_data_clean_2016 <- acus_data_2016 %>%
select(station_code = NMT,
station_name = Nombre,
avg_24_hours = LAeq24) %>%
# Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
mutate(station_code = as.factor(station_code),
avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours)),
# Add year and month columns
year = as.numeric(2016),
month = as.numeric(match(tolower(month), tolower(months))))
# Append the cleaned data to the 'all_acus_data_clean_2016' dataframe
all_acus_data_clean_2016 <- bind_rows(all_acus_data_clean_2016, acus_data_clean_2016)
}
Unfortunately, there is no available data for the months of June and July of 2015, forcing us to continue our analysis while acknowledging the non-random missigness of data during this period. We load the data and transform it:
# Create a vector of month abbreviations in the correct order
months <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Agosto",
"Septiembre", "Octubre", "Noviembre", "Diciembre")
# Create an empty dataframe to store the cleaned data
all_acus_data_clean_2015 <- data.frame()
# Loop through the months from January to December, excluding June and July
for (month in months) {
# Construct the file name
file_name <- paste0("2015_", month, "_acus.csv")
# Read the data
acus_data_2015 <- read_csv(file_name)
# Apply transformations
acus_data_clean_2015 <- acus_data_2015 %>%
select(station_code = NMT,
station_name = Nombre,
avg_24_hours = LAeq24) %>%
# Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
mutate(station_code = as.factor(station_code),
avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours)),
year = as.numeric(2015),
month = as.numeric(match(tolower(month), tolower(months))))
# Append the cleaned data to the 'all_acus_data_clean_2015' dataframe
all_acus_data_clean_2015 <- bind_rows(all_acus_data_clean_2015, acus_data_clean_2015)
}
We merge all three datasets into one with monthly observations from 2015 to 2021.
# Bind together observations of all years
full_data_acus <- bind_rows(all_acus_data_clean, all_acus_data_clean_2016, all_acus_data_clean_2015)
Data for noise pollution monitoring stations inside the City of Madrid
Just like we did with the pollution monitoring station network in Madrid, we download information about the location (address and coordinates) and unique identifiers of these noise quality stations. There are 31 in total, some of which overlap in location with pollution measuring stations (for instance the policy centroid, i.e. Plaza del Carmen) and some which do not.
# Read the data
acus_stations <- read_delim("EstacionesMedidaControlAcustico.csv", delim =";")
# Select relevant columns and extract number of station code
acus_stations_clean <- acus_stations %>%
select(station_code = NMT,
station_name = NOMBRE,
address = DIRECCION,
longitude = LONGITUD,
latitude = LATITUD) %>%
mutate(station_code = as.factor(gsub("\\D", "", station_code)),
longitude = as.numeric(longitude),
latitude = as.numeric(latitude))
Once again, we extract the coordinates of the Plaza del Carmen station and compute the distance of all 31 noise monitoring stations from it.
This time we create three different rings (as we are only considering stations within the City which do not extend past 15 km of the centroid), while preserving the range for the first three rings (i.e. both pollution and noise stations within 2.5 km are placed into Ring 1, those between 2.5 and 8 km into Ring 2, etc.). Ring 1 now contains 6 stations, compared to the 4 it had in the ring specification of the primary pollution regression.
# Extract latitude and longitude of Plaza del Carmen
plaza_carmen_longitude <- acus_stations_clean$longitude[acus_stations_clean$station_name == "Plaza del Carmen"]
plaza_carmen_latitude <- acus_stations_clean$latitude[acus_stations_clean$station_name == "Plaza del Carmen"]
# Calculate the distance between each station and Plaza del Carmen
acus_stations_clean <- acus_stations_clean %>%
mutate(distance_km = round(distHaversine(cbind(longitude, latitude),
c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>%
arrange(distance_km)
# Specify cutoffs for 'distance_km' and assign ring labels
acus_stations_clean$ring <- cut(acus_stations_clean$distance_km, breaks = c(-Inf, 2.5, 8, Inf),
labels = c("Ring 1: Treated",
"Ring 2: Treated",
"Ring 3: Control"))
# Display number of stations in each ring
acus_stations_clean %>% group_by(ring) %>% count()
## # A tibble: 3 × 2
## # Groups: ring [3]
## ring n
## <fct> <int>
## 1 Ring 1: Treated 6
## 2 Ring 2: Treated 15
## 3 Ring 3: Control 10
Data for weather stations in the Community of Madrid
We reproduce the strategy we used to match each pollution station with the closest weather station, only now we use the coordinates of the stations in the noise monitoring network of the City Council.
In theory meteorological conditions should not have a strong effect on noise levels. However, to ensure the comparability of our primary and placebo regressions, we will include weather covariates in the following model.
# Access data on weather stations and make transformations
weather_stations_r <- aemet_stations(verbose = FALSE, return_sf = FALSE) %>%
filter(provincia == "MADRID") %>%
select(-c(indicativo, indsinop, provincia)) %>%
rename(altitude = altitud,
longitude = longitud,
latitude = latitud,
weather_station_name = nombre)
# Create a matrix of coordinates for noise stations
coords_noise <- acus_stations_clean[, c("longitude", "latitude")]
# Create a matrix of coordinates for weather stations
coords_weather_r <- weather_stations_r[, c("longitude", "latitude")]
# Calculate distances between each noise station and each weather station
distances <- distm(coords_noise, coords_weather_r, fun = distHaversine)
# Find the index of the closest weather station for each noise station
closest_weather_index <- apply(distances, 1, which.min)
# Match each noise station to the closest weather station
acus_stations_clean$closest_weather_station <- weather_stations_r$weather_station_name[closest_weather_index]
Pre-modelling data aggregation and feature engineering
Now that we have all the data that we need, we merge noise level averages with the information about each station (including rings and closest weather stations) and remove observations after July 2021 to ensure our results are comparable to those of our main analysis.
# Merge data with monthly noise pollution averages and station data
model_pollution_acus <- full_data_acus %>%
left_join(acus_stations_clean, by = c("station_code", "station_name"))
# Remove data from July 2021 onwards
model_pollution_acus <- model_pollution_acus %>%
filter(year < 2021 | (year == 2021 & month < 7))
Next, we merge this new dataframe with the weather information.
# Rename weather station variable so that it is the same in both datasets
model_pollution_acus <- model_pollution_acus %>%
rename(weather_station_name = closest_weather_station)
# Merge pollution and weather data
did_df_acus <- model_pollution_acus %>%
left_join(weather_full_final_a,
by = c("year", "month", "weather_station_name"))
Just like we have done before, we create the post-treatment, treatment (only for Ring 1 since that is where we found significant effects) and the interaction term. We adjust variable types and discard irrelevant variables.
# Create post-treatment, treatment and interaction terms
did_df_acus <- did_df_acus %>% mutate(
# Create post-treatment dummy (assuming implementation is in November 2018)
POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0),
# Create treatment dummy for ring 1
treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
# Create interaction terms by multiplying POST and treatment ring
POST_treat_1 = POST*as.numeric(treatment_ring1))
# Transform new variables into factors
did_df_acus <- did_df_acus %>%
mutate(across(any_of(c("POST", "treatment_ring1", "POST_treat_1")), as.factor))
# Remove variables
did_df_acus <- did_df_acus %>% select(-c(min_temperature_monthly, max_temperature_monthly, address))
# Convert month variables to strings
did_df_acus$month <- as.character(did_df_acus$month)
Modelling
Finally, we run the model of average monthly noise levels on the treatment dummy (for Plaza del Carmen and the five closest stations) and the weather covariates. We compute clustered standard errors.
# Set up the fixed effects model
m_noise <- plm(avg_24_hours ~ POST + POST_treat_1 +
avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
wind_direction_monthly + avg_pressure_monthly +
month,
model = "within", index = "station_code",
data = did_df_acus)
# T-test with clustered standard errors
clustered_test_m_noise <- coeftest(m_noise, vcov = vcovHC(m_noise, type = "HC1", cluster = "group"))
print(clustered_test_m_noise)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## POST1 -230.68199 5.41079 -42.6337 < 2.2e-16 ***
## POST_treat_11 -0.13983 10.85402 -0.0129 0.989722
## avg_temperature_monthly -2.55793 1.12080 -2.2822 0.022567 *
## total_rain_monthly 0.26553 0.14242 1.8644 0.062396 .
## avg_wind_speed_monthly -103.09887 13.74835 -7.4990 9.149e-14 ***
## wind_direction_monthly 3.84028 0.92450 4.1539 3.389e-05 ***
## avg_pressure_monthly -2.72291 0.61979 -4.3933 1.168e-05 ***
## month10 8.92238 16.96038 0.5261 0.598889
## month11 66.67095 15.31216 4.3541 1.395e-05 ***
## month12 83.63105 14.92348 5.6040 2.347e-08 ***
## month2 44.12627 10.45118 4.2221 2.515e-05 ***
## month3 74.98101 13.11653 5.7165 1.229e-08 ***
## month4 48.69441 14.96806 3.2532 0.001158 **
## month5 77.66473 18.19179 4.2692 2.042e-05 ***
## month6 98.28343 20.33526 4.8332 1.433e-06 ***
## month7 96.64916 29.12647 3.3183 0.000920 ***
## month8 86.78405 22.89251 3.7909 0.000154 ***
## month9 69.11585 21.91805 3.1534 0.001635 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate clustered standard errors to export results using `stargazer()`
se_m_noise <- se_clustered(m_noise)
# Export results
stargazer(m_noise,
se = list(se_m_noise),
title = "Regression Results",
align = TRUE,
type = "text",
dep.var.caption = "Dependent Variable: Average Monthly Noise Level",
dep.var.labels.include = FALSE,
column.labels = c("Model 2 (Placebo Outcome)"),
omit = exclude_vars,
covariate.labels = variable_labels_r)
##
## Regression Results
## ==============================================================================
## Dependent Variable: Average Monthly Noise Level
## -----------------------------------------------
## Model 2 (Placebo Outcome)
## ------------------------------------------------------------------------------
## Post-Treatment -230.682***
## (5.411)
##
## Post-Treatment*Ring 1 -0.140
## (10.854)
##
## Average Monthly Temperature -2.558**
## (1.121)
##
## Total Monthly Rain 0.266*
## (0.142)
##
## Average Monthly Wind Speed -103.099***
## (13.748)
##
## Average Monthly Wind Direction 3.840***
## (0.924)
##
## Average Monthly Pressure -2.723***
## (0.620)
##
## ------------------------------------------------------------------------------
## Observations 2,333
## R2 0.264
## Adjusted R2 0.249
## F Statistic 45.605*** (df = 18; 2284)
## ==============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Luckily, with a p-value of 0.99, the effect of the intervention is found to have insignificant effects on the average noise level in Ring 1 stations. This increases the plausibility of observed effects on air pollution being truly attributable to the LEZ. The validity of our research design is thereby strengthened.
This section includes visualizations and tables belonging to the Appendix in the main document.
As we saw in Section 4.3, the assumption required for multiple imputation (MAR) might not be satisfied as missingness appears to be associated with the station. With the hope of improving the reliability of the imputation procedure, we incorporate weights following the approach in Gomila and Clark (2020). Larger weights are assigned to observations that have higher probabilities of going missing and smaller weights are assigned to observations with lower chances of missingness (Inverse Probability Weighting, or IPM).
We create a separate dataframe to impute missing values using weights and perform the same transformations as before:
# Repeat loading and transforming of weather data
weather_full_final_imp <- weather_full %>%
select(date = fecha,
weather_station_name = nombre,
avg_temperature = tmed,
min_temperature = tmin,
max_temperature = tmax,
total_rain = prec,
direction_max_wind_speed = dir,
avg_wind_speed = velmedia,
min_pressure = presMin,
max_pressure = presMax) %>%
# Replace comma with point
mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
direction_max_wind_speed = as.numeric(direction_max_wind_speed),
# Obtain 'year' and 'month' columns from 'date' column
date = ymd(date),
year = year(date),
month = month(date)) %>%
select(-date)
# Replace all values of "88" in wind direction with NAs
weather_full_final_imp$direction_max_wind_speed <-
ifelse(weather_full_final_imp$direction_max_wind_speed == "88", NA, weather_full_final_imp$direction_max_wind_speed)
Again, we take the example of minimum pressure and create a variable that takes a value of 1 for available observations and 0 for missing values. We then regress this new variable on the column capturing the weather station using logistic regression, after which we generate probabilities for missingness and the corresponding weights (i.e. inverse of probabilities).
We then perform imputation by Weighted Predictive Mean Matching (PMM) (or Weighted Normal Linear Regression). We incorporate the imputed values into the original dataframe, ‘weather_full_final_imp’.
set.seed(1234)
# Example for minimum pressure
weather_full_final_imp <- weather_full_final_imp %>%
mutate(missing = as.numeric(!is.na(weather_full_final_imp$min_pressure)))
# Model of missingness of minimum pressure conditional on the weather station
m_missing <- glm(missing ~ weather_station_name,
family = binomial(link = "logit"),
data = weather_full_final_imp)
# Calculate the predicted probabilities of missingness
prob_missing <- m_missing$fitted
# Calculate the inverse of the predicted probabilities (weights)
gen_weights <- 1/prob_missing
# Perform multiple imputations using weighted PMM
imputed_data <- mice::mice(weather_full_final_imp,
method = "weighted.pmm",
imputationWeights = gen_weights)
# Update original dataframe with imputed values
weather_full_final_imp$min_pressure <- mice::complete(imputed_data)[, "min_pressure"]
Before we look at the imputed distribution, we confirm that it is actually different from the distribution of the imputed variable using normal or unweighted PMM:
# Check the imputed distributions from PMM and PMM with weights are different
are_identical <- identical(weather_full_final$min_pressure, weather_full_final_imp$min_pressure)
print(are_identical)
## [1] FALSE
Below we see that the five-number summary and distribution of minimum pressure after imputing by PMM with weights is very similar to both the original and the unweighted PMM-imputed one. Hence, we retain the PMM imputation not involving weights for all our weather variables in the model.
# Print 5-number summary for 'min_pressure'
summary(weather_full_final_imp$min_pressure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 780.4 892.4 931.5 909.2 943.1 971.9
# Create histogram for the distribution of 'min_pressure' using weights
ggplot(weather_full_final_imp, aes(x = min_pressure)) +
geom_histogram(binwidth = 4, color = "black", fill = "lightpink", position ="identity") +
labs(x = "Values",
y = "Frequency",
title = "PMM-Imputed Distribution of Minimum Pressure with Weights") +
theme_light() +
theme(plot.title = element_text(size = 12))
In order to aid the user in the understanding of the rules applying to cars entering Madrid Central according to their polluting potential, we create the following table:
# Define variable names
labels <- c("No label", "B", "C", "ECO", "ZERO")
# Define descriptions
descriptions <- c("Unclassified gasoline and diesel vehicles",
"Gasoline vehicles registered after 2000 and diesel vehicles registered after 2006",
"Gasoline vehicles registered after 2006 and diesel vehicles registered after 2014",
"Plug-in hybrid vehicles with a range below 40 km and gas vehicles",
"Electric (100% and range-extended) and (plug-in) hybrid vehicles with a range over 40 km")
# Define access rules
access_rules <- c("Never",
"7 – 13 hours (off-street parking spaces)",
"7 – 17 hours (off-street parking spaces)",
"7 – 24 hours (on-street parking spaces)",
"24 hours")
# Create dataframe with variables names, descriptions and access rules
MC_description_df <- data.frame(labels, descriptions, access_rules)
# Convert 'labels' to a factor variable with the desired order of levels
MC_description_df$labels <- factor(MC_description_df$labels, levels = labels, ordered = TRUE)
# Rename columns in the final dataframe
names(MC_description_df) <- c("Label", "Description", "Access to Madrid Central LEZ*")
# Format table
kable(MC_description_df, format = "html", row.names = FALSE) %>%
kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| Label | Description | Access to Madrid Central LEZ* |
|---|---|---|
| No label | Unclassified gasoline and diesel vehicles | Never |
| B | Gasoline vehicles registered after 2000 and diesel vehicles registered after 2006 | 7 – 13 hours (off-street parking spaces) |
| C | Gasoline vehicles registered after 2006 and diesel vehicles registered after 2014 | 7 – 17 hours (off-street parking spaces) |
| ECO | Plug-in hybrid vehicles with a range below 40 km and gas vehicles | 7 – 24 hours (on-street parking spaces) |
| ZERO | Electric (100% and range-extended) and (plug-in) hybrid vehicles with a range over 40 km | 24 hours |
Similarly, we create a timeline guiding the user through the
different phases of the implementation of Madrid Central for the time
window under consideration using the vistime()function.
# Create dataframe with information about event (start and end date)
timeline_data <- data.frame(event = c("Transition",
"Sanctions",
"",
"Application with sanctions (slight rule change)"),
start = c("2018-11-30",
"2019-03-16",
"2019-07-01",
"2019-07-09"),
end = c("2019-03-15",
"2019-06-30",
"2019-07-08",
"2021-06-30"))
# Create static timeline with labels
vistime(timeline_data, optimize_y = TRUE, title = "Timeline of the Implementation of Madrid Central")
We also create an interactive version using hc_vistime()
from the highcharter package with slighly more detailed
information compared to the static timeline:
# Create dataframe with information about event (start and end date)
timeline_data <- data.frame(event = c("Transitionary Period I: no cameras",
"Transitionary Period II: monitoring system without sanctions",
"Application with sanctions",
"Suspension",
"Application with sanctions: slight rule change"),
start = c("2018-11-30",
"2019-01-01",
"2019-03-16",
"2019-07-01",
"2019-07-09"),
end = c("2019-01-01",
"2019-03-15",
"2019-07-01",
"2019-07-08",
"2021-06-30"))
# Create interactive timeline without labels
hc_vistime(timeline_data, show_labels = FALSE, title = "Timeline of the Implementation of Madrid Central")
Finally, we create a table with a brief explanation of the changes taking place during the dates included in the timelines.
# Define dates
dates <- c("30th November, 2018",
"1st January, 2019",
"15th March, 2019",
"1st - 7th July, 2019",
"1st January, 2020",
"July, 2021")
# Define events
event <- c("Official day of entry into force of Madrid Central",
"Activation of camera monitoring system",
"Beginning of sanctions",
"Temporary suspension of LEZ sanctions",
"Introduction of slightly modified rules",
"Approval of 'Madrid 360'")
# Define event descriptions
descriptions <- c("First phase of the transition period. No active cameras, with police officers monitoring and enforcing rules.",
"Automatic monitoring system with cameras at the LEZ perimeter. Warnings for non-compliance sent by post.",
"Infractions first subject to fines ranging up to 90 euros, or 45 euros with early payment.",
"Suspension of sanctions under the new mayor of Madrid, followed by reinstatement by Court order.",
"Parking ban on no-label vehicles and two streets released from restrictions (Mártires de Alcalá and Seminario de Nobles).",
"'Madrid 360' replaces Madrid Central. Changes include the replacement of Madrid Central by the 'Zona de Bajas Emisiones de Especial Protección Distrito Centro', a new LEZ in Plaza Elíptica and the provision of banning entrance to no-label vehicles into the LEZ starting in January 2022.")
# Create dataframe with dates, events and event descriptions
timeline_df <- data.frame(dates, event, descriptions)
# Convert 'dates' to a factor variable with the desired order of levels
timeline_df$dates <- factor(timeline_df$dates, levels = dates, ordered = TRUE)
# Rename columns in the final dataframe
names(timeline_df) <- c("Date", "Event", "Event Description")
# Format table
kable(timeline_df, format = "html", row.names = FALSE) %>%
kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| Date | Event | Event Description |
|---|---|---|
| 30th November, 2018 | Official day of entry into force of Madrid Central | First phase of the transition period. No active cameras, with police officers monitoring and enforcing rules. |
| 1st January, 2019 | Activation of camera monitoring system | Automatic monitoring system with cameras at the LEZ perimeter. Warnings for non-compliance sent by post. |
| 15th March, 2019 | Beginning of sanctions | Infractions first subject to fines ranging up to 90 euros, or 45 euros with early payment. |
| 1st - 7th July, 2019 | Temporary suspension of LEZ sanctions | Suspension of sanctions under the new mayor of Madrid, followed by reinstatement by Court order. |
| 1st January, 2020 | Introduction of slightly modified rules | Parking ban on no-label vehicles and two streets released from restrictions (Mártires de Alcalá and Seminario de Nobles). |
| July, 2021 | Approval of ‘Madrid 360’ | ‘Madrid 360’ replaces Madrid Central. Changes include the replacement of Madrid Central by the ‘Zona de Bajas Emisiones de Especial Protección Distrito Centro’, a new LEZ in Plaza Elíptica and the provision of banning entrance to no-label vehicles into the LEZ starting in January 2022. |
Gomila, R., and C. S. Clark. 2020. “Missing Data in Experiments: Challenges and Solutions”. American Psychological Association, 27(2): 143–155.
Pearl, J. 2009. Causality. 2nd ed. Cambridge, MA: Cambridge University Press.
Rubin, D. B. 1976. “Inference and Missing Data”. Biometrika, 63(3): 581–592.
Snow, J. 1855. On the Mode of Communication of Cholera. John Churchill.